TIP2023

Abstract:
In this paper, we propose a novel dehazing method based on self-distillation. In contrast to conventional knowledge distillation approaches that transfer large models (teacher networks) to small models (student networks), we introduce a single knowledge distillation network that transfers network parameters to itself for dehazing. In the early stages, the proposed network transfers scene content (identity) information to the next stage of itself using haze-free data. However, in the later stages, the network transfers haze information to itself using haze data, enabling the accurate dehazing of input images using scene information from the early stages. In a single network, parameters are seamlessly updated from extracting global scene features to dehazing the scene. During the training, forward propagation acts as a teacher network, whereas backward propagation acts as a student network. The experimental results demonstrate that the proposed method considerably outperforms other state-of-the-art dehazing methods.

Abstract:
Multi-Codebook Quantization (MCQ) is a generalized version of existing codebook-based quantizations for Approximate Nearest Neighbor (ANN) search. Specifically, MCQ picks one codeword for each sub-codebook independently and takes the sum of picked codewords to approximate the original vector. The objective function involves no constraints, therefore, MCQ theoretically has the potential to achieve the best performance because solutions of other codebook-based quantization methods are all covered by MCQ’s solution space under the same codebook size setting. However, finding the optimal solution to MCQ is proved to be NP-hard due to its encoding process, i.e., converting an input vector to a binary code. To tackle this, researchers apply constraints to it to find near-optimal solutions or employ heuristic algorithms that are still time-consuming for encoding. Different from previous approaches, this paper takes the first attempt to find a deep solution to MCQ. The encoding network is designed to be as simple as possible, so the very complex encoding problem becomes simply a feed-forward. Compared with other methods on three datasets, our method shows state-of-the-art performance. Notably, our method is 11× - 38× faster than heuristic algorithms for encoding, which makes it more practical for the real scenery of large-scale retrieval. Our code is publicly available: https://github.com/DeepMCQ/DeepQ.

Abstract:
Zero-shot learning (ZSL) aims to identify unseen classes with zero samples during training. Broadly speaking, present ZSL methods usually adopt class-level semantic labels and compare them with instance-level semantic predictions to infer unseen classes. However, we find that such existing models mostly produce imbalanced semantic predictions, i.e. these models could perform precisely for some semantics, but may not for others. To address the drawback, we aim to introduce an imbalanced learning framework into ZSL. However, we find that imbalanced ZSL has two unique challenges: (1) Its imbalanced predictions are highly correlated with the value of semantic labels rather than the number of samples as typically considered in the traditional imbalanced learning; (2) Different semantics follow quite different error distributions between classes. To mitigate these issues, we first formalize ZSL as an imbalanced regression problem which offers empirical evidences to interpret how semantic labels lead to imbalanced semantic predictions. We then propose a re-weighted loss termed Re-balanced Mean-Squared Error (ReMSE), which tracks the mean and variance of error distributions, thus ensuring rebalanced learning across classes. As a major contribution, we conduct a series of analyses showing that ReMSE is theoretically well established. Extensive experiments demonstrate that the proposed method effectively alleviates the imbalance in semantic prediction and outperforms many state-of-the-art ZSL methods.

Abstract:
Event cameras, or dynamic vision sensors, have recently achieved success from fundamental vision tasks to high-level vision researches. Due to its ability to asynchronously capture light intensity changes, event camera has an inherent advantage to capture moving objects in challenging scenarios including objects under low light, high dynamic range, or fast moving objects. Thus event camera are natural for visual object tracking. However, the current event-based trackers derived from RGB trackers simply modify the input images to event frames and still follow conventional tracking pipeline that mainly focus on object texture for target distinction. As a result, the trackers may not be robust dealing with challenging scenarios such as moving cameras and cluttered foreground. In this paper, we propose a distractor-aware event-based tracker that introduces transformer modules into Siamese network architecture (named DANet). Specifically, our model is mainly composed of a motion-aware network and a target-aware network, which simultaneously exploits both motion cues and object contours from event data, so as to discover motion objects and identify the target object by removing dynamic distractors. Our DANet can be trained in an end-to-end manner without any post-processing and can run at over 80 FPS on a single V100. We conduct comprehensive experiments on two large event tracking datasets to validate the proposed model. We demonstrate that our tracker has superior performance against the state-of-the-art trackers in terms of both accuracy and efficiency.

Abstract:
Faster computation of a weighted median (WM) filter is impeded by the construction of a weighted histogram for every local window of data. Since the calculated weights vary for each local window, it is difficult, using a sliding window approach, to construct the weighted histogram efficiently. In this paper, we propose a novel WM filter that overcomes the difficulty of histogram construction. Our proposed method achieves real-time processing for higher resolution images and can be applied to multidimensional, multichannel, and high precision data. The weight kernel used in our WM filter is the pointwise guided filter, which is derived from the guided filter. The use of kernels based on the guided filter avoids gradient reversal artifacts and shows a higher denoising performance than the Gaussian kernel based on the color/intensity distance. The core idea of the proposed method is a formulation that allows the use of histogram updates with a sliding window approach to find the weighted median. For high precision data we propose an algorithm based on a linked list that can reduce the memory requirements of storing histograms and the computational cost of updating them. We present implementations of the proposed method that are suitable for both CPU and GPU. Experimental results show that the proposed method indeed realizes faster computation than conventional WM filters and is capable of filtering multidimensional, multichannel, and high precision data. This is an approach which is difficult to achieve with conventional methods.

Abstract:
Domain adaptation methods reduce domain shift typically by learning domain-invariant features. Most existing methods are built on distribution matching, e.g., adversarial domain adaptation, which tends to corrupt feature discriminability. In this paper, we propose Discriminative Radial Domain Adaptation (DRDA) which bridges source and target domains via a shared radial structure. It’s motivated by the observation that as the model is trained to be progressively discriminative, features of different categories expand outwards in different directions, forming a radial structure. We show that transferring such an inherently discriminative structure would enable to enhance feature transferability and discriminability simultaneously. Specifically, we represent each domain with a global anchor and each category a local anchor to form a radial structure and reduce domain shift via structure matching. It consists of two parts, namely isometric transformation to align the structure globally and local refinement to match each category. To enhance the discriminability of the structure, we further encourage samples to cluster close to the corresponding local anchors based on optimal-transport assignment. Extensively experimenting on multiple benchmarks, our method is shown to consistently outperforms state-of-the-art approaches on varied tasks, including the typical unsupervised domain adaptation, multi-source domain adaptation, domain-agnostic learning, and domain generalization.

Abstract:
Recently, tremendous human-designed and automatically searched neural networks have been applied to image denoising. However, previous works intend to handle all noisy images in a pre-defined static network architecture, which inevitably leads to high computational complexity for good denoising quality. Here, we present a dynamic slimmable denoising network (DDS-Net), a general method to achieve good denoising quality with less computational complexity, via dynamically adjusting the channel configurations of networks at test time with respect to different noisy images. Our DDS-Net is empowered with the ability of dynamic inference by a dynamic gate, which can predictively adjust the channel configuration of networks with negligible extra computation cost. To ensure the performance of each candidate sub-network and the fairness of the dynamic gate, we propose a three-stage optimization scheme. In the first stage, we train a weight-shared slimmable super network. In the second stage, we evaluate the trained slimmable super network in an iterative way and progressively tailor the channel numbers of each layer with minimal denoising quality drop. By a single pass, we can obtain several sub-networks with good performance under different channel configurations. In the last stage, we identify easy and hard samples in an online way and train a dynamic gate to predictively select the corresponding sub-network with respect to different noisy images. Extensive experiments demonstrate our DDS-Net consistently outperforms the state-of-the-art individually trained static denoising networks.

Abstract:
In this paper, we propose a semi-sparsity smoothing method based on a new sparsity-induced minimization scheme. The model is derived from the observations that semi-sparsity prior knowledge is universally applicable in situations where sparsity is not fully admitted such as in the polynomial-smoothing surfaces. We illustrate that such priors can be identified into a generalized L_0 -norm minimization problem in higher-order gradient domains, giving rise to a new “feature-aware” filter with a powerful simultaneous-fitting ability in both sparse singularities (corners and salient edges) and polynomial-smoothing surfaces. Notice that a direct solver to the proposed model is not available due to the non-convexity and combinatorial nature of L_0 -norm minimization. Instead, we propose to solve it approximately based on an efficient half-quadratic splitting technique. We demonstrate its versatility and many benefits to a series of signal/image processing and computer vision applications.

Abstract:
Using a sequence of discrete still images to tell a story or introduce a process has become a tradition in the field of digital visual media. With the surge in these media and the requirements in downstream tasks, acquiring their main topics or genres in a very short time is urgently needed. As a representative form of the media, comic enjoys a huge boom as it has gone digital. However, different from natural images, comic images are divided by panels, and the images are not visually consistent from page to page. Therefore, existing works tailored for natural images perform poorly in analyzing comics. Considering the identification of comic genres is tied to the overall story plotting, a long-term understanding that makes full use of the semantic interactions between multi-level comic fragments needs to be fully exploited. In this paper, we propose \textP^2 Comic, a Panel-Page-aware Comic genre classification model, which takes page sequences of comics as the input and produces class-wise probabilities. \textP^2 Comic utilizes detected panel boxes to extract panel representations and deploys self-attention to construct panel-page understanding, assisted with interdependent classifiers to model label correlation. We develop the first comic dataset for the task of comic genre classification with multi-genre labels. Our approach is proved by experiments to outperform state-of-the-art methods on related tasks. We also validate the extensibility of our network to perform in the multi-modal scenario. Finally, we show the practicability of our approach by giving effective genre prediction results for whole comic books.

Abstract:
Conventional social media platforms usually downscale high-resolution (HR) images to restrict their resolution to a specific size for saving transmission/storage cost, which makes those visual details inaccessible to other users. To bypass this obstacle, recent invertible image downscaling methods jointly model the downscaling/upscaling problems and achieve impressive performance. However, they only consider fixed integer scale factors and may be inapplicable to generic downscaling tasks towards resolution restriction as posed by social media platforms. In this paper, we propose an effective and universal Scale-Arbitrary Invertible Image Downscaling Network (AIDN), to downscale HR images with arbitrary scale factors in an invertible manner. Particularly, the HR information is embedded in the downscaled low-resolution (LR) counterparts in a nearly imperceptible form such that our AIDN can further restore the original HR images solely from the LR images. The key to supporting arbitrary scale factors is our proposed Conditional Resampling Module (CRM) that conditions the downscaling/upscaling kernels and sampling locations on both scale factors and image content. Extensive experimental results demonstrate that our AIDN achieves top performance for invertible downscaling with both arbitrary integer and non-integer scale factors. Also, both quantitative and qualitative evaluations show our AIDN is robust to the lossy image compression standard. The source code and trained models are publicly available at https://github.com/Doubiiu/AIDN.

Abstract:
As an effective data augmentation method, Mixup synthesizes an extra amount of samples through linear interpolations. Despite its theoretical dependency on data properties, Mixup reportedly performs well as a regularizer and calibrator contributing reliable robustness and generalization to deep model training. In this paper, inspired by Universum Learning which uses out-of-class samples to assist the target tasks, we investigate Mixup from a largely under-explored perspective - the potential to generate in-domain samples that belong to none of the target classes, that is, universum. We find that in the framework of supervised contrastive learning, Mixup-induced universum can serve as surprisingly high-quality hard negatives, greatly relieving the need for large batch sizes in contrastive learning. With these findings, we propose Universum-inspired supervised Contrastive learning (UniCon), which incorporates Mixup strategy to generate Mixup-induced universum as universum negatives and pushes them apart from anchor samples of the target classes. We extend our method to the unsupervised setting, proposing Unsupervised Universum-inspired contrastive model (Un-Uni). Our approach not only improves Mixup with hard labels, but also innovates a novel measure to generate universum data. With a linear classifier on the learned representations, UniCon shows state-of-the-art performance on various datasets. Specially, UniCon achieves 81.7% top-1 accuracy on CIFAR-100, surpassing the state of art by a significant margin of 5.2% with a much smaller batch size, typically, 256 in UniCon vs. 1024 in SupCon (Khosla et al., 2020) using ResNet-50. Un-Uni also outperforms SOTA methods on CIFAR-100. The code of this paper is released on https://github.com/hannaiiyanggit/UniCon.

Affiliations: ReLER Lab, AAII, University of Technology Sydney, Ultimo, NSW, Australia; Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Fusionopolis, Singapore; Shenzhen Key Laboratory of Visual Object Detection and Recognition, Harbin Institute of Technology (Shenzhen), Shenzhen, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Data Dynamic Laboratory, School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW, Australia; ReLER Laboratory, Faculty of Engineering and Information Technology, AAII, University of Technology Sydney, Ultimo, NSW, Australia

Abstract:
The goal of Camouflaged object detection (COD) is to detect objects that are visually embedded in their surroundings. Existing COD methods only focus on detecting camouflaged objects from seen classes, while they suffer from performance degradation to detect unseen classes. However, in a real-world scenario, collecting sufficient data for seen classes is extremely difficult and labeling them requires high professional skills, thereby making these COD methods not applicable. In this paper, we propose a new zero-shot COD framework (termed as ZSCOD), which can effectively detect the never unseen classes. Specifically, our framework includes a Dynamic Graph Searching Network (DGSNet) and a Camouflaged Visual Reasoning Generator (CVRG). In details, DGSNet is proposed to adaptively capture more edge details for boosting the COD performance. CVRG is utilized to produce pseudo-features that are closer to the real features of the seen camouflaged objects, which can transfer knowledge from seen classes to unseen classes to help detect unseen objects. Besides, our graph reasoning is built on a dynamic searching strategy, which can pay more attention to the boundaries of objects for reducing the influences of background. More importantly, we construct the first zero-shot COD benchmark based on the COD10K dataset. Experimental results on public datasets show that our ZSCOD not only detects the camouflaged object of unseen classes but also achieves state-of-the-art performance in detecting seen classes.

Abstract:
Existing deep learning-based shadow removal methods still produce images with shadow remnants. These shadow remnants typically exist in homogeneous regions with low-intensity values, making them untraceable in the existing image-to-image mapping paradigm. We observe that shadows mainly degrade images at the image-structure level (in which humans perceive object shapes and continuous colors). Hence, in this paper, we propose to remove shadows at the image structure level. Based on this idea, we propose a novel structure-informed shadow removal network (StructNet) to leverage the image-structure information to address the shadow remnant problem. Specifically, StructNet first reconstructs the structure information of the input image without shadows and then uses the restored shadow-free structure prior to guiding the image-level shadow removal. StructNet contains two main novel modules: 1) a mask-guided shadow-free extraction (MSFE) module to extract image structural features in a non-shadow-to-shadow directional manner; and 2) a multi-scale feature & residual aggregation (MFRA) module to leverage the shadow-free structure information to regularize feature consistency. In addition, we also propose to extend StructNet to exploit multi-level structure information (MStructNet), to further boost the shadow removal performance with minimum computational overheads. Extensive experiments on three shadow removal benchmarks demonstrate that our method outperforms existing shadow removal methods, and our StructNet can be integrated with existing methods to improve them further.

Abstract:
Automatic sketch colorization is a challenging task that aims to generate a color image from a sketch, primarily due to its inherently ill-posed nature. While many approaches have shown promising results, two significant challenges remain: limited color patterns and a wide range of artifacts such as color bleeding and semantic inconsistencies among relevant regions. These issues stem from the operation of traditional convolutional structures, which capture structural features in a pixel-wise manner, resulting in inadequate utilization of regional information within the sketch. Therefore, we propose the Region-Assisted Sketch Coloring (RASC) method, which introduces an intermediate representation called the ‘Region Map’ to explicitly characterize the regional information of the sketch. This Region Map is derived from the input sketch and is effectively formulated by our RASC architecture, enhancing the perception of region-wise features beyond the original pixel-wise features. Specifically, we start by employing the sketch encoder to extract hierarchical feature maps from the input sketches. Subsequently, we introduce a coarse-to-fine decoder comprising a series of Region-based Modulation (RM) blocks. This decoder modulates features that combine the modulation results of its previous block and the sketch features of the corresponding encoder block with our Region Formulation module. Each module explicitly formulates the sketch features in a region-wise manner. This accurately captures both the inner-region local style and inter-region global context dependency, resulting in various color patterns and fewer synthesis artifacts. Our experimental results show that our proposed method surpasses state-of-the-art methods in both synthetic and real sketch datasets.

Abstract:
Facial expression editing has attracted increasing attention with the advance of deep neural networks in recent years. However, most existing methods suffer from compromised editing fidelity and limited usability as they either ignore pose variations (unrealistic editing) or require paired training data (not easy to collect) for pose controls. This paper presents POCE, an innovative pose-controllable expression editing network that can generate realistic facial expressions and head poses simultaneously with just unpaired training images. POCE achieves the more accessible and realistic pose-controllable expression editing by mapping face images into UV space, where facial expressions and head poses can be disentangled and edited separately. POCE has two novel designs. The first is self-supervised UV completion that allows to complete UV maps sampled under different head poses, which often suffer from self-occlusions and missing facial texture. The second is weakly-supervised UV editing that allows to generate new facial expressions with minimal modification of facial identity, where the synthesized expression could be controlled by either an expression label or directly transplanted from a reference UV map via feature transfer. Extensive experiments show that POCE can learn from unpaired face images effectively, and the learned model can generate realistic and high-fidelity facial expressions under various new poses.

Abstract:
Fully perceiving the surrounding world is a vital capability for autonomous robots. To achieve this goal, a multi-camera system is usually equipped on the data collecting platform and the structure from motion (SfM) technology is used for scene reconstruction. However, although incremental SfM achieves high-precision modeling, it is inefficient and prone to scene drift in large-scale reconstruction tasks. In this paper, we propose a tailored incremental SfM framework for multi-camera systems, where the internal relative poses between cameras can not only be calibrated automatically but also serve as an additional constraint to improve the system robustness. Previous multi-camera based modeling work has mainly focused on stereo setups or multi-camera systems with known calibration information, but we allow arbitrary configurations and only require images as input. First, one camera is selected as the reference camera, and the other cameras in the multi-camera system are denoted as non-reference cameras. Based on the pose relationship between the reference and non-reference camera, the non-reference camera pose can be derived from the reference camera pose and internal relative poses. Then, a two-stage multi-camera based camera registration module is proposed, where the internal relative poses are computed first by local motion averaging, and then the rigid units are registered incrementally. Finally, a multi-camera based bundle adjustment is put forth to iteratively refine the reference camera and the internal relative poses. Experiments demonstrate that our system achieves higher accuracy and robustness on benchmark data compared to the state-of-the-art SfM and SLAM (simultaneous localization and mapping) methods.

Abstract:
The quality of ICC profiles with embedded look-up tables (LUTs) depends on multiple factors: 1. the accuracy of the optical printer model, 2. the exploitation of the available gamut combined with the quality of the gamut mapping approach encoded in the B2A-LUTs (backwards LUTs) and 3. the tonal smoothness as well color accuracy of the backwards LUTs. It can be shown that optimizing the smoothness of the LUTs comes at the expense of color accuracy and requires gamut reduction because of internal tonal edges. We present a method to optimize backwards LUTs of existing ICC profiles w.r.t accuracy, smoothness, gamut exploitation and mapping, which can be extended beyond color, e.g. to joint color and translucency backward LUTs. The approach is based on a perceptual difference metric that is used to optimize the LUT’s tonal smoothness constrained to preserve both the accuracy of and the relationship between colors.

Abstract:
In practical media distribution systems, visual content usually undergoes multiple stages of quality degradation along the delivery chain, but the pristine source content is rarely available at most quality monitoring points along the chain to serve as a reference for quality assessment. As a result, full-reference (FR) and reduced-reference (RR) image quality assessment (IQA) methods are generally infeasible. Although no-reference (NR) methods are readily applicable, their performance is often not reliable. On the other hand, intermediate references of degraded quality are often available, e.g., at the input of video transcoders, but how to make the best use of them in proper ways has not been deeply investigated. Here we make one of the first attempts to establish a new paradigm named degraded-reference IQA (DR IQA). Specifically, by using a two-stage distortion pipeline we lay out the architectures of DR IQA and introduce a 6-bit code to denote the choices of configurations. We construct the first large-scale databases dedicated to DR IQA and have made them publicly available. We make novel observations on distortion behavior in multi-stage distortion pipelines by comprehensively analyzing five multiple distortion combinations. Based on these observations, we develop novel DR IQA models and make extensive comparisons with a series of baseline models derived from top-performing FR and NR models. The results suggest that DR IQA may offer significant performance improvement in multiple distortion environments, thereby establishing DR IQA as a valid IQA paradigm that is worth further exploration.

Abstract:
Superpixel is the over-segmentation region of an image, whose basic units “pixels” have similar properties. Although many popular seeds-based algorithms have been proposed to improve the segmentation quality of superpixels, they still suffer from the seeds initialization problem and the pixel assignment problem. In this paper, we propose Vine Spread for Superpixel Segmentation (VSSS) to form superpixel with high quality. First, we extract image color and gradient features to define the soil model that establishes a “soil” environment for vine, and then we define the vine state model by simulating the vine “physiological” state. Thereafter, to catch more image details and twigs of the object, we propose a new seeds initialization strategy that perceives image gradients at the pixel-level and without randomness. Next, to balance the boundary adherence and the regularity of the superpixel, we define a three-stage “parallel spreading” vine spread process as a novel pixel assignment scheme, in which the proposed nonlinear velocity for vines helps to form the superpixel with regular shape and homogeneity, the crazy spreading mode for vines and the soil averaging strategy help to enhance the boundary adherence of superpixel. Finally, a series of experimental results demonstrate that our VSSS offers competitive performance in the seed-based methods, especially in catching object details and twigs, balancing boundary adherence and obtaining regular shape superpixels.

Abstract:
This paper studies the problem of unsupervised domain adaptive hashing, which is less-explored but emerging for efficient image retrieval, particularly for cross-domain retrieval. This problem is typically tackled by learning hashing networks with pseudo-labeling and domain alignment techniques. Nevertheless, these approaches usually suffer from overconfident and biased pseudo-labels and inefficient domain alignment without sufficiently exploring semantics, thus failing to achieve satisfactory retrieval performance. To tackle this issue, we present PEACE, a principled framework which holistically explores semantic information in both source and target data and extensively incorporates it for effective domain alignment. For comprehensive semantic learning, PEACE leverages label embeddings to guide the optimization of hash codes for source data. More importantly, to mitigate the effects of noisy pseudo-labels, we propose a novel method to holistically measure the uncertainty of pseudo-labels for unlabeled target data and progressively minimize them through alternative optimization under the guidance of the domain discrepancy. Additionally, PEACE effectively removes domain discrepancy in the Hamming space from two views. In particular, it not only introduces composite adversarial learning to implicitly explore semantic information embedded in hash codes, but also aligns cluster semantic centroids across domains to explicitly exploit label information. Experimental results on several popular domain adaptive retrieval benchmarks demonstrate the superiority of our proposed PEACE compared with various state-of-the-art methods on both single-domain and cross-domain retrieval tasks. Our source codes are available at https://github.com/WillDreamer/PEACE.

Abstract:
The area of domain adaptation has been instrumental in addressing the domain shift problem encountered by many deep learning applications. This problem arises due to the difference between the distributions of source data used for training in comparison with target data used during realistic testing scenarios. In this paper, we introduce a novel MultiScale Domain Adaptive YOLO (MS-DAYOLO) framework that employs multiple domain adaptation paths and corresponding domain classifiers at different scales of the YOLOv4 object detector. Building on our baseline multiscale DAYOLO framework, we introduce three novel deep learning architectures for a Domain Adaptation Network (DAN) that generates domain-invariant features. In particular, we propose a Progressive Feature Reduction (PFR), a Unified Classifier (UC), and an Integrated architecture. We train and test our proposed DAN architectures in conjunction with YOLOv4 using popular datasets. Our experiments show significant improvements in object detection performance when training YOLOv4 using the proposed MS-DAYOLO architectures and when tested on target data for autonomous driving applications. Moreover, MS-DAYOLO framework achieves an order of magnitude real-time speed improvement relative to Faster R-CNN solutions while providing comparable object detection performance.

Abstract:
Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are “plug-and-play”: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods.

Abstract:
Human-object relationship detection reveals the fine-grained relationship between humans and objects, helping the comprehensive understanding of videos. Previous human-object relationship detection approaches are mainly developed with object features and relation features without exploring the specific information of humans. In this paper, we propose a novel Relation-Pose Transformer (RPT) for human-object relationship detection. Inspired by the coordination of eye-head-body movements in cognitive science, we employ the head pose to find those crucial objects that humans focus on and use the body pose with skeleton information to represent multiple actions. Then, we utilize the spatial encoder to capture spatial contextualized information of the relation pair, which integrates the relation features and pose features. Next, the temporal decoder aims to model the temporal dependency of the relationship. Finally, we adopt multiple classifiers to predict different types of relationships. Extensive experiments on the benchmark Action Genome validate the effectiveness of our proposed method and show the state-of-the-art performance compared with related methods.

Abstract:
Estimating the 3D structure of the drivable surface and surrounding environment is a crucial task for assisted and autonomous driving. It is commonly solved either by using 3D sensors such as LiDAR or directly predicting the depth of points via deep learning. However, the former is expensive, and the latter lacks the use of geometry information for the scene. In this paper, instead of following existing methodologies, we propose Road Planar Parallax Attention Network (RPANet), a new deep neural network for 3D sensing from monocular image sequences based on planar parallax, which takes full advantage of the omnipresent road plane geometry in driving scenes. RPANet takes a pair of images aligned by the homography of the road plane as input and outputs a \gamma map (the ratio of height to depth) for 3D reconstruction. The \gamma map has the potential to construct a two-dimensional transformation between two consecutive frames. It implies planar parallax and can be combined with the road plane serving as a reference to estimate the 3D structure by warping the consecutive frames. Furthermore, we introduce a novel cross-attention module to make the network better perceive the displacements caused by planar parallax. To verify the effectiveness of our method, we sample data from the Waymo Open Dataset and construct annotations related to planar parallax. Comprehensive experiments are conducted on the sampled dataset to demonstrate the 3D reconstruction accuracy of our approach in challenging scenarios.

Abstract:
Network pruning is one of the chief means for improving the computational efficiency of Deep Neural Networks (DNNs). Pruning-based methods generally discard network kernels, channels, or layers, which however inevitably will disrupt original well-learned network correlation and thus lead to performance degeneration. In this work, we propose an Efficient Layer Compression (ELC) approach to efficiently compress serial layers by decoupling and merging rather than pruning. Specifically, we first propose a novel decoupling module to decouple the layers, enabling us readily merge serial layers that include both nonlinear and convolutional layers. Then, the decoupled network is losslessly merged based on the equivalent conversion of the parameters. In this way, our ELC can effectively reduce the depth of the network without destroying the correlation of the convolutional layers. To our best knowledge, we are the first to exploit the mergeability of serial convolutional layers for lossless network layer compression. Experimental results conducted on two datasets demonstrate that our method retains superior performance with a FLOPs reduction of 74.1% for VGG-16 and 54.6% for ResNet-56, respectively. In addition, our ELC improves the inference speed by 2× on Jetson AGX Xavier edge device.

Abstract:
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms. Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner. Distortion type identification and degradation level determination is employed as an auxiliary task to train a deep learning model containing a deep Convolutional Neural Network (CNN) that extracts spatial features, as well as a recurrent unit that captures temporal information. The model is trained using a contrastive loss and we therefore refer to this training framework and resulting model as CONtrastive VIdeo Quality EstimaTor (CONVIQT). During testing, the weights of the trained model are frozen, and a linear regressor maps the learned features to quality scores in a no-reference (NR) setting. We conduct comprehensive evaluations of the proposed model against leading algorithms on multiple VQA databases containing wide ranges of spatial and temporal distortions. We analyze the correlations between model predictions and ground-truth quality ratings, and show that CONVIQT achieves competitive performance when compared to state-of-the-art NR-VQA models, even though it is not trained on those databases. Our ablation experiments demonstrate that the learned representations are highly robust and generalize well across synthetic and realistic distortions. Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.

Abstract:
Deep learning (DL) based methods for motion deblurring, taking advantage of large-scale datasets and sophisticated network structures, have reported promising results. However, two challenges still remain: existing methods usually perform well on synthetic datasets but cannot deal with complex real-world blur, and in addition, over- and under-estimation of the blur will result in restored images that remain blurred and even introduce unwanted distortion. We propose a motion deblurring framework that includes a Blur Space Disentangled Network (BSDNet) and a Hierarchical Scale-recurrent Deblurring Network (HSDNet) to address these issues. Specifically, we train an image blurring model to facilitate learning a better image deblurring model. Firstly, BSDNet learns how to separate the blur features from blurry images, which is adaptable for blur transferring, dataset augmentation, and ultimately directing the deblurring model. Secondly, to gradually recover sharp information in a coarse-to-fine manner, HSDNet makes full use of the blur features acquired by BSDNet as a priori and breaks down the non-uniform deblurring task into various subtasks. Moreover, the motion blur dataset created by BSDNet also bridges the gap between training images and actual blur. Extensive experiments on real-world blur datasets demonstrate that our method works effectively on complex scenarios, resulting in the best performance that significantly outperforms many state-of-the-art approaches.

Abstract:
Talking face generation is the process of synthesizing a lip-synchronized video when given a reference portrait and an audio clip. However, generating a fine-grained talking video is nontrivial due to several challenges: 1) capturing vivid facial expressions, such as muscle movements; 2) ensuring smooth transitions between consecutive frames; and 3) preserving the details of the reference portrait. Existing efforts have only focused on modeling rigid lip movements, resulting in low-fidelity videos with jerky facial muscle deformations. To address these challenges, we propose a novel Fine-gRained mOtioN moDel (FROND), consisting of three components. In the first component, we adopt a two-stream encoder to capture local facial movement keypoints and embed their overall motion context as the global code. In the second component, we design a motion estimation module to predict audio-driven movements. This enables the learning of local key point motion in the continuous trajectory space to achieve smooth temporal facial movements. Additionally, the local and global motions are fused to estimate a continuous dense motion field, resulting in spatially smooth movements. In the third component, we devise a novel implicit image decoder based on an implicit neural network. This decoder recovers high-frequency information from the input image, resulting in a high-fidelity talking face. In summary, the FROND refines the motion trajectories of facial keypoints into a continuous dense motion field, which is followed by a decoder that fully exploits the inherent smoothness of the motion. We conduct quantitative and qualitative model evaluations on benchmark datasets. The experimental results show that our proposed FROND significantly outperforms several state-of-the-art baselines.

Abstract:
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance, leading to intra-modal information distortion and ambiguity problems. Accordingly, in this paper, we propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID. To transfer the multi-modal knowledge effectively, we conduct fine-grained information excavation to mine modality-shared discriminative details for global alignment. Specifically, we propose a multi-level global feature learning (MGF) module that fully mines the discriminative local information within each modality, thereby emphasizing identity-related discriminative clues through enhanced interaction between global image (text) and informative local patches (words). MGF generates a set of enhanced global features for later inference. Furthermore, we design cross-grained feature refinement (CFR) and fine-grained correspondence discovery (FCD) modules to establish cross-modal correspondence at both coarse and fine-grained levels (image-word, sentence-patch, word-patch), ensuring the reliability of informative local patches/words. CFR and FCD are removed during inference to optimize computational efficiency. Extensive experiments on multiple benchmarks demonstrate the superior performance of our method in TIReID.

Abstract:
Multi-view subspace clustering is an important topic in cluster analysis. Its aim is to utilize the complementary information conveyed by multiple views of objects to be clustered. Recently, view-shared anchor learning based multi-view clustering methods have been developed to speed up the learning of common data representation. Although widely applied to large-scale scenarios, most of the existing approaches are still faced with two limitations. First, they do not pay sufficient consideration on the negative impact caused by certain noisy views with unclear clustering structures. Second, many of them only focus on the multi-view consistency, yet are incapable of capturing the cross-view diversity. As a result, the learned complementary features may be inaccurate and adversely affect clustering performance. To solve these two challenging issues, we propose a Fast Self-guided Multi-view Subspace Clustering (FSMSC) algorithm which skillfully integrates the view-shared anchor learning and global-guided-local self-guidance learning into a unified model. Such an integration is inspired by the observation that the view with clean clustering structures will play a more crucial role in grouping the clusters when the features of all views are concatenated. Specifically, we first learn a locally-consistent data representation shared by all views in the local learning module, then we learn a globally-discriminative data representation from multi-view concatenated features in the global learning module. Afterwards, a feature selection matrix constrained by the \ell _2,1 -norm is designed to construct a guidance from global learning to local learning. In this way, the multi-view consistent and diverse information can be simultaneously utilized and the negative impact caused by noisy views can be overcame to some extent. Extensive experiments on different datasets demonstrate the effectiveness of our proposed fast self-guided learning model, and its promising performance compared to both, the state-of-the-art non-deep and deep multi-view clustering algorithms. The code of this paper is available at https://github.com/chenzhe207/FSMSC.

Abstract:
Data in real world are usually characterized in multiple views, including different types of features or different modalities. Multi-view learning has been popular in the past decades and achieved significant improvements. In this paper, we investigate three challenging problems in the field of incomplete multi-view representation learning, namely, i) how to reduce the influences produced by missing views in multi-view dataset, ii) how to learn a consistent and informative representation among different views and iii) how to alleviate the impacts of the inherent noise in multi-view data caused by high-dimensional features or varied quality for different data points. To address these challenges, we integrate these three tasks into a problem and propose a novel framework termed Noise-aware Incomplete Multi-view Learning Networks (NIM-Nets). NIM-Nets fully utilize incomplete data from different views to produce a multi-view shared representation which is consistent, informative and robust to noise. We model the inherent noise in data by defining the distribution \Gamma and assuming that each observation in the incomplete dataset is sampled from the distribution \Gamma . To the best of our knowledge, this is the first work to unify learning the consistent and informative representation, alleviating the impacts of noise in data and handling the view-missing patterns in multi-view learning into a framework. We also first give a definition of robustness and completeness for incomplete multi-view representation learning. Based on NIM-Nets, we present joint optimization models for classification and clustering, respectively. Extensive experiments on different datasets demonstrate the effectiveness of our method over the existing work based on classification and clustering tasks in terms of different metrics.

Abstract:
The local parts of the target are vitally important for robust object tracking. Nevertheless, existing excellent context regression methods involving siamese networks and discrimination correlation filters mostly represent the target appearance from the holistic model, showing high sensitivity in scenarios with partial occlusion and drastic appearance changes. In this paper, we address this issue by proposing a novel part-aware framework based on context regression, which simultaneously considers the global and local parts of the target and fully exploits their relationship to be collaboratively aware of the target state online. To this end, the spatial-temporal measure among context regressors corresponding to multiple parts is designed to evaluate the tracking quality of each part regressor by solving the imbalance among global and local parts. The coarse target locations provided by part regressors are further aggregated by treating their measures as weights to refine the final target location. Furthermore, the divergence of multiple part regressors in each frame reveals the interference degree of background noise, which is quantified to control the proposed combination window functions in part regressors to adaptively filter redundant noise. Besides, the spatial-temporal information among part regressors is also leveraged to assist in accurately estimating the target scale. Extensive evaluations demonstrate that the proposed framework help many context regression trackers achieve performance improvements and perform favorably against state-of-the-art methods on the popular benchmarks: OTB, TC128, UAV, UAVDT, VOT, TrackingNet, GOT-10k, LaSOT.

Abstract:
Compared to color images captured by conventional RGB cameras, monochrome (mono) images usually have higher signal-to-noise ratios (SNR) and richer textures due to the lack of color filter arrays in mono cameras. Therefore, using a mono-color stereo dual-camera system, we can integrate the lightness information of target monochrome images with the color information of guidance RGB images to accomplish image enhancement in a colorization manner. In this work, based on two assumptions, we introduce a novel probabilistic-concept guided colorization framework. First, adjacent contents with similar luminance are likely to have similar colors. By lightness matching, we can utilize colors of the matched pixels to estimate the target color value. Second, by matching multiple pixels from the guidance image, if more of these matched pixels have similar luminance values to the target one, we can estimate colors with more confidence. Based on the statistical distribution of multiple matching results, we retain the reliable color estimates as initial dense scribbles and then propagate them to the rest of the mono image. However, for a target pixel, the color information provided by its matching results is quite redundant. Hence, we introduce a patch sampling strategy to accelerate the colorization process. Based on the analysis of the posteriori probability distribution of the sampling results, we can use much fewer matches for color estimation and reliability assessment. To alleviate incorrect color propagation in the sparsely scribbled regions, we generate extra color seeds according to the existed scribbles to guide the propagation process. Experimental results show that, our algorithm can efficiently and effectively restore color images with higher SNR and richer details from the mono-color image pairs, and achieves good performance in solving the color bleeding problem.

Abstract:
Crowd counting is the basic task of crowd analysis and it is of great significance in the field of public safety. Therefore, it receives more and more attention recently. The common idea is to combine the crowd counting task with convolutional neural networks to predict the corresponding density map, which is generated by filtering the dot labels with specific Gaussian kernels. Although the counting performance is promoted by the newly proposed networks, they all suffer one conjunct problem, which is due to the perspective effect, there is significant scale contrast among targets in different positions within one scene, but the existing density maps can not represent this scale change well. To address the prediction difficulties caused by target scale variation, we propose a scale-sensitive crowd density map estimation framework, which focuses on dealing with target scale change from density map generation, network design, and model training stage. It consists of the Adaptive Density Map (ADM), Deformable Density Map Decoder (DDMD), and Auxiliary Branch. To be specific, the Gaussian kernel size variates adaptively based on target size to generate ADM that contains scale information for each specific target. DDMD introduces the deformable convolution to fit the Gaussian kernel variation and boosts the model’s scale sensitivity. The Auxiliary Branch guides the learning of deformable convolution offsets during the training phase. Finally, we construct experiments on different large-scale datasets. The results show the effectiveness of the proposed ADM and DDMD. Furthermore, the visualization demonstrates that deformable convolution learns the target scale variation.

Abstract:
Recently, learning-based algorithms have shown impressive performance in underwater image enhancement. Most of them resort to training on synthetic data and obtain outstanding performance. However, these deep methods ignore the significant domain gap between the synthetic and real data (i.e., inter-domain gap), and thus the models trained on synthetic data often fail to generalize well to real-world underwater scenarios. Moreover, the complex and changeable underwater environment also causes a great distribution gap among the real data itself (i.e., intra-domain gap). However, almost no research focuses on this problem and thus their techniques often produce visually unpleasing artifacts and color distortions on various real images. Motivated by these observations, we propose a novel Two-phase Underwater Domain Adaptation network (TUDA) to simultaneously minimize the inter-domain and intra-domain gap. Concretely, in the first phase, a new triple-alignment network is designed, including a translation part for enhancing realism of input images, followed by a task-oriented enhancement part. With performing image-level, feature-level and output-level adaptation in these two parts through jointly adversarial learning, the network can better build invariance across domains and thus bridging the inter-domain gap. In the second phase, an easy-hard classification of real data according to the assessed quality of enhanced images is performed, in which a new rank-based underwater quality assessment method is embedded. By leveraging implicit quality information learned from rankings, this method can more accurately assess the perceptual quality of enhanced images. Using pseudo labels from the easy part, an easy-hard adaptation technique is then conducted to effectively decrease the intra-domain gap between easy and hard samples. Extensive experimental results demonstrate that the proposed TUDA is significantly superior to existing works in terms of both visual quality and quantitative metrics.

Abstract:
3D object detection algorithms for autonomous driving reason about 3D obstacles either from 3D birds-eye view or perspective view or both. Recent works attempt to improve the detection performance via mining and fusing from multiple egocentric views. Although the egocentric perspective view alleviates some weaknesses of the birds-eye view, the sectored grid partition becomes so coarse in the distance that the targets and surrounding context mix together, which makes the features less discriminative. In this paper, we generalize the research on 3D multi-view learning and propose a novel multi-view-based 3D detection method, named X-view, to overcome the drawbacks of the multi-view methods. Specifically, X-view breaks through the traditional limitation about the perspective view whose original point must be consistent with the 3D Cartesian coordinate. X-view is designed as a general paradigm that can be applied on almost any 3D detectors based on LiDAR with only little increment of running time, no matter it is voxel/grid-based or raw-point-based. We conduct experiments on KITTI and NuScenes datasets to demonstrate the robustness and effectiveness of our proposed X-view. The results show that X-view obtains consistent improvements when combined with mainstream state-of-the-art 3D methods.

Abstract:
Most multi-exposure image fusion (MEF) methods perform unidirectional alignment within limited and local regions, which ignore the effects of augmented locations and preserve deficient global features. In this work, we propose a multi-scale bidirectional alignment network via deformable self-attention to perform adaptive image fusion. The proposed network exploits differently exposed images and aligns them to the normal exposure in varying degrees. Specifically, we design a novel deformable self-attention module that considers variant long-distance attention and interaction and implements the bidirectional alignment for image fusion. To realize adaptive feature alignment, we employ a learnable weighted summation of different inputs and predict the offsets in the deformable self-attention module, which facilitates that the model generalizes well in various scenes. In addition, the multi-scale feature extraction strategy makes the features across different scales complementary and provides fine details and contextual features. Extensive experiments demonstrate that our proposed algorithm performs favorably against state-of-the-art MEF methods.

Affiliations: School of Intelligence Science and Technology, Peking University, Beijing, China; School of Software, Tsinghua University, Beijing, China; Department of Precision Instrument, Tsinghua University, Beijing, China; Department of Psychology, Tsinghua University, Beijing, China; Advanced Computing and Storage Laboratory, Huawei Technologies Company Ltd., Beijing, China; School of Artificial Intelligence, Dalian University of Technology, Dalian, China; Intelligent Vision Department, Huawei Technologies Company Ltd., Beijing, China; Ascend Laboratory, Huawei Technologies Company Ltd., Beijing, China; Institute of Automation, Chinese Academy of Sciences, Beijing, China

Abstract:
In the past years, attention-based Transformers have swept across the field of computer vision, starting a new stage of backbones in semantic segmentation. Nevertheless, semantic segmentation under poor light conditions remains an open problem. Moreover, most papers about semantic segmentation work on images produced by commodity frame-based cameras with a limited framerate, hindering their deployment to auto-driving systems that require instant perception and response at milliseconds. An event camera is a new sensor that generates event data at microseconds and can work in poor light conditions with a high dynamic range. It looks promising to leverage event cameras to enable perception where commodity cameras are incompetent, but algorithms for event data are far from mature. Pioneering researchers stack event data as frames so that event-based segmentation is converted to frame-based segmentation, but characteristics of event data are not explored. Noticing that event data naturally highlight moving objects, we propose a posterior attention module that adjusts the standard attention by the prior knowledge provided by event data. The posterior attention module can be readily plugged into many segmentation backbones. Plugging the posterior attention module into a recently proposed SegFormer network, we get EvSegFormer (the event-based version of SegFormer) with state-of-the-art performance in two datasets (MVSEC and DDD-17) collected for event-based segmentation. Code is available at https://github.com/zexiJia/EvSegFormer to facilitate research on event-based vision.

Abstract:
Recent efforts on learning-based image denoising approaches use unrolled architectures with a fixed number of repeatedly stacked blocks. However, due to difficulties in training networks corresponding to deeper layers, simply stacking blocks may cause performance degradation, and the number of unrolled blocks needs to be manually tuned to find an appropriate value. To circumvent these problems, this paper describes an alternative approach with implicit models. To our best knowledge, our approach is the first attempt to model iterative image denoising through an implicit scheme. The model employs implicit differentiation to calculate gradients in the backward pass, thus avoiding the training difficulties of explicit models and elaborate selection of the iteration number. Our model is parameter-efficient and has only one implicit layer, which is a fixed-point equation that casts the desired noise feature as its solution. By simulating infinite iterations of the model, the final denoising result is given by the equilibrium that is achieved through accelerated black-box solvers. The implicit layer not only captures the non-local self-similarity prior for image denoising, but also facilitates training stability and thereby boosts the denoising performance. Extensive experiments show that our model leads to better performances than state-of-the-art explicit denoisers with enhanced qualitative and quantitative results.

Abstract:
This paper proposes a new glass segmentation method utilizing paired RGB and thermal images. Due to the large difference between the transmission property of visible light and that of the thermal energy through the glass where most glass is transparent to the visible light but opaque to thermal energy, glass regions of a scene are made more distinguishable with a pair of RGB and thermal images than solely with an RGB image. To exploit such a unique property, we propose a neural network architecture that effectively combines an RGB-thermal image pair with a new multi-modal fusion module based on attention, and integrate CNN and transformer to extract local features and non-local dependencies, respectively. As well, we have collected a new dataset containing 5551 RGB-thermal image pairs with ground-truth segmentation annotations. The qualitative and quantitative evaluations demonstrate the effectiveness of the proposed approach on fusing RGB and thermal data for glass segmentation. Our code and data are available at https://github.com/Dong-Huo/RGB-T-Glass-Segmentation.

Abstract:
Image dehazing is a representative low-level vision task that estimates latent haze-free images from hazy images. In recent years, convolutional neural network-based methods have dominated image dehazing. However, vision Transformers, which has recently made a breakthrough in high-level vision tasks, has not brought new dimensions to image dehazing. We start with the popular Swin Transformer and find that several of its key designs are unsuitable for image dehazing. To this end, we propose DehazeFormer, which consists of various improvements, such as the modified normalization layer, activation function, and spatial information aggregation scheme. We train multiple variants of DehazeFormer on various datasets to demonstrate its effectiveness. Specifically, on the most frequently used SOTS indoor set, our small model outperforms FFA-Net with only 25% #Param and 5% computational cost. To the best of our knowledge, our large model is the first method with the PSNR over 40 dB on the SOTS indoor set, dramatically outperforming the previous state-of-the-art methods. We also collect a large-scale realistic remote sensing dehazing dataset for evaluating the method’s capability to remove highly non-homogeneous haze. We share our code and dataset at https://github.com/IDKiro/DehazeFormer.

Abstract:
Typical methods for pedestrian detection focus on either tackling mutual occlusions between crowded pedestrians, or dealing with the various scales of pedestrians. Detecting pedestrians with substantial appearance diversities such as different pedestrian silhouettes, different viewpoints or different dressing, remains a crucial challenge. Instead of learning each of these diverse pedestrian appearance features individually as most existing methods do, we propose to perform contrastive learning to guide the feature learning in such a way that the semantic distance between pedestrians with different appearances in the learned feature space is minimized to eliminate the appearance diversities, whilst the distance between pedestrians and background is maximized. To facilitate the efficiency and effectiveness of contrastive learning, we construct an exemplar dictionary with representative pedestrian appearances as prior knowledge to construct effective contrastive training pairs and thus guide contrastive learning. Besides, the constructed exemplar dictionary is further leveraged to evaluate the quality of pedestrian proposals during inference by measuring the semantic distance between the proposal and the exemplar dictionary. Extensive experiments on both daytime and nighttime pedestrian detection validate the effectiveness of the proposed method.

Abstract:
Self-supervised video-based action recognition is a challenging task, which needs to extract the principal information characterizing the action from content-diversified videos over large unlabeled datasets. However, most existing methods choose to exploit the natural spatio-temporal properties of video to obtain effective action representations from a visual perspective, while ignoring the exploration of the semantic that is closer to human cognition. For that, a self-supervised Video-based Action Recognition method with Disturbances called VARD, which extracts the principal information of the action in terms of the visual and semantic, is proposed. Specifically, according to cognitive neuroscience research, the recognition ability of humans is activated by visual and semantic attributes. An intuitive impression is that minor changes of the actor or scene in video do not affect one person’s recognition of the action. On the other hand, different humans always make consistent opinions when they recognize the same action video. In other words, for an action video, the necessary information that remains constant despite the disturbances in the visual video or the semantic encoding process is sufficient to represent the action. Therefore, to learn such information, we construct a positive clip/embedding for each action video. Compared to the original video clip/embedding, the positive clip/embedding is disturbed visually/semantically by Video Disturbance and Embedding Disturbance. Our objective is to pull the positive closer to the original clip/embedding in the latent space. In this way, the network is driven to focus on the principal information of the action while the impact of sophisticated details and inconsequential variations is weakened. It is worthwhile to mention that the proposed VARD does not require optical flow, negative samples, and pretext tasks. Extensive experiments conducted on the UCF101 and HMDB51 datasets demonstrate that the proposed VARD effectively improves the strong baseline and outperforms multiple classical and advanced self-supervised action recognition methods.

Abstract:
It is challenging to characterize the intrinsic geometry of high-degree algebraic curves with lower-degree algebraic curves. The reduction in the curve’s degree implies lower computation costs, which is crucial for various practical computer vision systems. In this paper, we develop a characteristic mapping (CM) to recursively degenerate \mathbf 3n points on a planar curve of n th order to \mathbf 3(n-1) points on a curve of \mathbf (n-1) th order. The proposed characteristic mapping enables curve grouping on a line, a curve of the lowest order, that preserves the intrinsic geometric properties of a higher-order curve (ellipse). We prove a necessary condition and derive an efficient arc grouping module that finds valid elliptical arc segments by determining whether the mapped three points are colinear, invoking minimal computation. We embed the module into two latest arc-based ellipse detection methods, which reduces their running time by 25% and 50% on average over five widely used data sets. This yields faster detection than the state-of-the-art algorithms while keeping their precision comparable or even higher. Two CM embedded methods also significantly surpass a deep learning method on all evaluation metrics.

Abstract:
Background cues play an accompanying role in most regression trackers, where they directly learn a mapping from dense sampling to soft label by giving a search area. In essence, the trackers need to identify a large amount of background information (i.e., other objects and distractor objects) under the circumstance of extreme target-background data imbalance. Therefore, we believe that it is more worth performing regression tracking depending on the informative background cues and using target cues as supplementary. To do this, we propose a capsule-based approach, referred to as CapsuleBI, which performs regression tracking based on a background inpainting network and a target-aware network. The background inpainting network explores the background representations by restoring the region of the target with all available scenes, and a target-aware network captures the target representations by focusing on the target itself only. To explore the subjects/distractors in the whole scene, we propose a global-guided feature construction module, which helps enhance the local features with global information. Both the background and target are encoded in capsules, which can model the relationships between objects or object parts in the background scene. Apart from this, the target-aware network assists the background inpainting network with a novel background-target routing algorithm that guides the background and target capsules to estimate the target location with multi-video relationships information precisely. Extensive experimental results show that the proposed tracker achieves favorably against state-of-the-art methods.

Abstract:
Not everybody can be equipped with professional photography skills and sufficient shooting time, and there can be some tilts in the captured images occasionally. In this paper, we propose a new and practical task, named Rotation Correction, to automatically correct the tilt with high content fidelity in the condition that the rotated angle is unknown. This task can be easily integrated into image editing applications, allowing users to correct the rotated images without any manual operations. To this end, we leverage a neural network to predict the optical flows that can warp the tilted images to be perceptually horizontal. Nevertheless, the pixel-wise optical flow estimation from a single image is severely unstable, especially in large-angle tilted images. To enhance its robustness, we propose a simple but effective prediction strategy to form a robust elastic warp. Particularly, we first regress the mesh deformation that can be transformed into robust initial optical flows. Then we estimate residual optical flows to facilitate our network the flexibility of pixel-wise deformation, further correcting the details of the tilted images. To establish an evaluation benchmark and train the learning framework, a comprehensive rotation correction dataset is presented with a large diversity in scenes and rotated angles. Extensive experiments demonstrate that even in the absence of the angle prior, our algorithm can outperform other state-of-the-art solutions requiring this prior. The code and dataset are available at https://github.com/nie-lang/RotationCorrection.

Abstract:
Video summarization aims to generate a compact summary of the original video for efficient video browsing. To provide video summaries which are consistent with the human perception and contain important content, supervised learning-based video summarization methods are proposed. These methods aim to learn important content based on continuous frame information of human-created summaries. However, simultaneously considering both of inter-frame correlations among non-adjacent frames and intra-frame attention which attracts the humans for frame importance representations are rarely discussed in recent methods. To address these issues, we propose a novel transformer-based method named spatiotemporal vision transformer (STVT) for video summarization. The STVT is composed of three dominant components including the embedded sequence module, temporal inter-frame attention (TIA) encoder, and spatial intra-frame attention (SIA) encoder. The embedded sequence module generates the embedded sequence by fusing the frame embedding, index embedding and segment class embedding to represent the frames. The temporal inter-frame correlations among non-adjacent frames are learned by the TIA encoder with the multi-head self-attention scheme. Then, the spatial intra-frame attention of each frame is learned by the SIA encoder. Finally, a multi-frame loss is computed to drive the learning of the network in an end-to-end trainable manner. By simultaneously using both inter-frame and intra-frame information, our method outperforms state-of-the-art methods in both of the SumMe and TVSum datasets. The source code of the spatiotemporal vision transformer will be available at https://github.com/nchucvml/STVT.

Abstract:
The light absorption and scattering of underwater impurities lead to poor underwater imaging quality. The existing data-driven based underwater image enhancement (UIE) techniques suffer from the lack of a large-scale dataset containing various underwater scenes and high-fidelity reference images. Besides, the inconsistent attenuation in different color channels and space areas is not fully considered for boosted enhancement. In this work, we built a large scale underwater image (LSUI) dataset, which covers more abundant underwater scenes and better visual quality reference images than existing underwater datasets. The dataset contains 4279 real-world underwater image groups, in which each raw image’s clear reference images, semantic segmentation map and medium transmission map are paired correspondingly. We also reported an U-shape Transformer network where the transformer model is for the first time introduced to the UIE task. The U-shape Transformer is integrated with a channel-wise multi-scale feature fusion transformer (CMSFFT) module and a spatial-wise global feature modeling transformer (SGFMT) module specially designed for UIE task, which reinforce the network’s attention to the color channels and space areas with more serious attenuation. Meanwhile, in order to further improve the contrast and saturation, a novel loss function combining RGB, LAB and LCH color spaces is designed following the human vision principle. The extensive experiments on available datasets validate the state-of-the-art performance of the reported technique with more than 2dB superiority. The dataset and demo code are available at https://bianlab.github.io/.

Abstract:
Benefiting from the intuitiveness and naturalness of sketch interaction, sketch-based video retrieval (SBVR) has received considerable attention in the video retrieval research area. However, most existing SBVR research still lacks the capability of accurate video retrieval with fine-grained scene content. To address this problem, in this paper we investigate a new task, which focuses on retrieving the target video by utilizing a fine-grained storyboard sketch depicting the scene layout and major foreground instances’ visual characteristics (e.g., appearance, size, pose, etc.) of video; we call such a task “fine-grained scene-level SBVR”. The most challenging issue in this task is how to perform scene-level cross-modal alignment between sketch and video. Our solution consists of two parts. First, we construct a scene-level sketch-video dataset called SketchVideo, in which sketch-video pairs are provided and each pair contains a clip-level storyboard sketch and several keyframe sketches (corresponding to video frames). Second, we propose a novel deep learning architecture called Sketch Query Graph Convolutional Network (SQ-GCN). In SQ-GCN, we first adaptively sample the video frames to improve video encoding efficiency, and then construct appearance and category graphs to jointly model visual and semantic alignment between sketch and video. Experiments show that our fine-grained scene-level SBVR framework with SQ-GCN architecture outperforms the state-of-the-art fine-grained retrieval methods. The SketchVideo dataset and SQ-GCN code are available in the project webpage https://iscas-mmsketch.github.io/FG-SL-SBVR/.

Abstract:
By introducing parameters with local information, several types of orthogonal moments have recently been developed for the extraction of local features in an image. But with the existing orthogonal moments, local features cannot be well-controlled with these parameters. The reason lies in that zeros distribution of these moments’ basis function cannot be well-adjusted by the introduced parameters. To overcome this obstacle, a new framework, transformed orthogonal moment (TOM), is set up. Most existing continuous orthogonal moments, such as Zernike moments, fractional-order orthogonal moments (FOOMs), etc. are all special cases of TOM. To control the basis function’s zeros distribution, a novel local constructor is designed, and local orthogonal moment (LOM) is proposed. Zeros distribution of LOM’s basis function can be adjusted with parameters introduced by the designed local constructor. Consequently, locations, where local features extracted from by LOM, are more accurate than those by FOOMs. In comparison with Krawtchouk moments and Hahn moments etc., the range, where local features are extracted from by LOM, is order insensitive. Experimental results demonstrate that LOM can be utilized to extract local features in an image.

Abstract:
Blind image super-resolution (blind SR) aims to generate high-resolution (HR) images from low-resolution (LR) input images with unknown degradations. To enhance the performance of SR, the majority of blind SR methods introduce an explicit degradation estimator, which helps the SR model adjust to unknown degradation scenarios. Unfortunately, it is impractical to provide concrete labels for the multiple combinations of degradations (e. g., blurring, noise, or JPEG compression) to guide the training of the degradation estimator. Moreover, the special designs for certain degradations hinder the models from being generalized for dealing with other degradations. Thus, it is imperative to devise an implicit degradation estimator that can extract discriminative degradation representations for all types of degradations without requiring the supervision of degradation ground-truth. To this end, we propose a Meta-Learning based Region Degradation Aware SR Network (MRDA), including Meta-Learning Network (MLN), Degradation Extraction Network (DEN), and Region Degradation Aware SR Network (RDAN). To handle the lack of ground-truth degradation, we use the MLN to rapidly adapt to the specific complex degradation after several iterations and extract implicit degradation information. Subsequently, a teacher network MRDAT is designed to further utilize the degradation information extracted by MLN for SR. However, MLN requires iterating on paired LR and HR images, which is unavailable in the inference phase. Therefore, we adopt knowledge distillation (KD) to make the student network learn to directly extract the same implicit degradation representation (IDR) as the teacher from LR images. Furthermore, we introduce an RDAN module that is capable of discerning regional degradations, allowing IDR to adaptively influence various texture patterns. Extensive experiments under classic and real-world degradation settings show that MRDA achieves SOTA performance and can generalize to various degradation processes.

Abstract:
Establishing reliable correspondences between two views is one of the most important components of various vision tasks. This paper proposes a novel sparse-to-local-dense (S2LD) matching method to conduct fully differentiable correspondence estimation with the prior from epipolar geometry. The sparse-to-local-dense matching asymmetrically establishes correspondences with consistent sub-pixel coordinates while reducing the computation of matching. The salient features are explicitly located, and the description is conditioned on both views with the global receptive field provided by the attention mechanism. The correspondences are progressively established in multiple levels to reduce the underlying re-projection error. We further propose a 3D noise-aware regularizer with differentiable triangulation. Additional guidance from 3D space is encoded by the regularizer in training to handle the supervision noise caused by the errors in camera poses and depth maps. The proposed method demonstrates outstanding matching accuracy and geometric estimation capability on multiple datasets and tasks.

Abstract:
In image processing, images are usually composed of partial views due to the uncertainty of collection and how to efficiently process these images, which is called incomplete multi-view learning, has attracted widespread attention. The incompleteness and diversity of multi-view data enlarges the difficulty of annotation, resulting in the divergence of label distribution between the training and testing data, named as label shift. However, existing incomplete multi-view methods generally assume that the label distribution is consistent and rarely consider the label shift scenario. To address this new but important challenge, we propose a novel framework termed as Incomplete Multi-view Learning under Label Shift (IMLLS). In this framework, we first give the formal definitions of IMLLS and the bidirectional complete representation which describes the intrinsic and common structure. Then, a multilayer perceptron which combines the reconstruction and classification loss is employed to learn the latent representation, whose existence, consistency and universality are proved with the theoretical satisfaction of label shift assumption. After that, to align the label distribution, the learned representation and trained source classifier are used to estimate the importance weight by designing a new estimation scheme which balances the error generated by finite samples in theory. Finally, the trained classifier reweighted by the estimated weight is fine-tuned to reduce the gap between the source and target representations. Extensive experimental results validate the effectiveness of our algorithm over existing state-of-the-arts methods in various aspects, together with its effectiveness in discriminating schizophrenic patients from healthy controls.

Abstract:
Improving boundary segmentation results has recently attracted increasing attention in the field of semantic segmentation. Since existing popular methods usually exploit the long-range context, the boundary cues are obscure in the feature space, leading to poor boundary results. In this paper, we propose a novel conditional boundary loss (CBL) for semantic segmentation to improve the performance of the boundaries. The CBL creates a unique optimization goal for each boundary pixel, conditioned on its surrounding neighbors. The conditional optimization of the CBL is easy yet effective. In contrast, most previous boundary-aware methods have difficult optimization goals or may cause potential conflicts with the semantic segmentation task. Specifically, the CBL enhances the intra-class consistency and inter-class difference, by pulling each boundary pixel closer to its unique local class center and pushing it away from its different-class neighbors. Moreover, the CBL filters out noisy and incorrect information to obtain precise boundaries, since only surrounding neighbors that are correctly classified participate in the loss calculation. Our loss is a plug-and-play solution that can be used to improve the boundary segmentation performance of any semantic segmentation network. We conduct extensive experiments on ADE20K, Cityscapes, and Pascal Context, and the results show that applying the CBL to various popular segmentation networks can significantly improve the mIoU and boundary F-score performance.

Abstract:
Spectral Embedding (SE) has often been used to map data points from non-linear manifolds to linear subspaces for the purpose of classification and clustering. Despite significant advantages, the subspace structure of data in the original space is not preserved in the embedding space. To address this issue subspace clustering has been proposed by replacing the SE graph affinity with a self-expression matrix. It works well if the data lies in a union of linear subspaces however, the performance may degrade in real-world applications where data often spans non-linear manifolds. To address this problem we propose a novel structure-aware deep spectral embedding by combining a spectral embedding loss and a structure preservation loss. To this end, a deep neural network architecture is proposed that simultaneously encodes both types of information and aims to generate structure-aware spectral embedding. The subspace structure of the input data is encoded by using attention-based self-expression learning. The proposed algorithm is evaluated on six publicly available real-world datasets. The results demonstrate the excellent clustering performance of the proposed algorithm compared to the existing state-of-the-art methods. The proposed algorithm has also exhibited better generalization to unseen data points and it is scalable to larger datasets without requiring significant computational resources.

Abstract:
Projected clustering is the foundation of deep clustering models. Aiming at catching the essence of deep clustering, we propose a novel projected clustering framework by summarizing the core properties of prevalent powerful models, especially deep models. At first, we introduce the aggregated mapping, consisting of projection learning and neighbor estimation, to obtain clustering-friendly representation. Importantly, we theoretically prove that the simple clustering-friendly representation learning may suffer from severe degeneration, which can be regarded as over-fitting. Roughly speaking, the well-trained model would group neighboring points into plenty of sub-clusters. These small sub-clusters may scatter randomly due to no connection between them. The degeneration may occur more frequently with the increasing of model capacity. We accordingly develop a self-evolution mechanism that implicitly aggregates the sub-clusters and the proposed method can alleviate the potential risk of over-fitting and obtain prominent improvement. The ablation experiments support the theoretical analysis and verify the effectiveness of the neighbor-aggregation mechanism. Finally, we show how to choose the unsupervised projection function through two specific examples, including a linear method (namely locality analysis) and a non-linear model.

Abstract:
Vision Transformers (ViTs) split an image into fixed-size patches as tokens. This strategy has succeeded in computer vision tasks, but introduces considerable tokens similar in semantics and appearances. This work proposes Token Merger to spot redundant tokens and merge them into a compact representation to accelerate ViTs. For each forward inference, the Token Merger first identifies meta tokens to represent meaningful cues of the image content, then adaptively merges similar tokens into a uniform one referring to meta tokens. To pursue a reasonable tradeoff between accuracy and efficiency, we further introduce learnable gates to adaptively decide the token merge ratios of different layers. As a generalizable module, Token Merger can be easily plugged into different layers of ViTs to boost their efficiency. Visualizations show that Token Merger progressively merges tokens and finally learns a compact set of tokens representing clear semantics. Compared with token pruning methods, Token Merger is more effective in preserving meaning contextual cues, thus performs and generalizes substantially better in different vision tasks. Extensive experiments and comparisons with other state-of-the-art downsampling methods also demonstrate its promising performance. For instance, it reduces 95% tokens and accelerates the inference speed by 62%. Meanwhile, the ImageNet classification accuracy only drops by 0.4%. The code will be available.

Abstract:
In this paper, we address the problem of multi-view clustering (MVC), integrating the close relationships among views to learn a consistent clustering result, via triplex information maximization (TIM). TIM works by proposing three essential principles, each of which is realized by a formulation of maximization of mutual information. 1) Principle 1: Contained. The first and foremost thing for MVC is to fully employ the self-contained information in each view. 2) Principle 2: Complementary. The feature-level complementary information across pairwise views should be first quantified and then integrated for improving clustering. 3) Principle 3: Compatible. The rich cluster-level shared compatible information among individual clustering of each view is significant for ensuring a better final consistent result. Following these principles, TIM can enjoy the best of view-specific, cross-view feature-level, and cross-view cluster-level information within/among views. For principle 2, we design an automatic view correlation learning (AVCL) mechanism to quantify how much complementary information across views by learning the cross-view weights between pairwise views automatically, instead of view-specific weights as most existing MVCs do. Specifically, we propose two different strategies for AVCL, i.e., feature-based and cluster-based strategy, for effective cross-view weight learning, thus leading to two versions of our method, TIM-F and TIM-C, respectively. We further present a two-stage method for optimization of the proposed methods, followed by the theoretical convergence and complexity analysis. Extensive experimental results suggest the effectiveness and superiority of our methods over many state-of-the-art methods.

Abstract:
The visual feature pyramid has shown its superiority in both effectiveness and efficiency in a variety of applications. However, current methods overly focus on inter-layer feature interactions while disregarding the importance of intra-layer feature regulation. Despite some attempts to learn a compact intra-layer feature representation with the use of attention mechanisms or vision transformers, they overlook the crucial corner regions that are essential for dense prediction tasks. To address this problem, we propose a Centralized Feature Pyramid (CFP) network for object detection, which is based on a globally explicit centralized feature regulation. Specifically, we first propose a spatial explicit visual center scheme, where a lightweight MLP is used to capture the globally long-range dependencies, and a parallel learnable visual center mechanism is used to capture the local corner regions of the input images. Based on this, we then propose a globally centralized regulation for the commonly-used feature pyramid in a top-down fashion, where the explicit visual center information obtained from the deepest intra-layer feature is used to regulate frontal shallow features. Compared to the existing feature pyramids, CFP not only has the ability to capture the global long-range dependencies but also efficiently obtain an all-round yet discriminative feature representation. Experimental results on the challenging MS-COCO validate that our proposed CFP can achieve consistent performance gains on the state-of-the-art YOLOv5 and YOLOX object detection baselines.

Abstract:
Surface-defect detection aims to accurately locate and classify defect areas in images via pixel-level annotations. Different from the objects in traditional image segmentation, defect areas comprise a small group of pixels with random shapes, characterized by uncommon textures and edges that are inconsistent with the normal surface patterns of industrial products. This task-specific knowledge is hardly considered in the current methods. Therefore, we propose a two-stage “promotion-suppression” transformer (PST) framework, which explicitly adopts the wavelet features to guide the network to focus on the detailed features in the images. Specifically, in the promotion stage, we propose the Haar augmentation module to improve the backbone’s sensitivity to high-frequency details. However, the background noise is inevitably amplified as well because it also constitutes high-frequency information. Therefore, a quadratic feature-fusion module (QFFM) is proposed in the suppression stage, which exploits the two properties of noise: independence and attenuation. The QFFM analyzes the similarities and differences between noise and defect features to achieve noise suppression. Compared with the traditional linear-fusion approach, the QFFM is more sensitive to high-frequency details; thus, it can afford highly discriminative features. Extensive experiments are conducted on three datasets, namely DAGM, MT, and CRACK500, which demonstrate the superiority of the proposed PST framework.

Abstract:
Multiple-choice visual question answering (VQA) is a challenging task due to the requirement of thorough multimodal understanding and complicated inter-modality relationship reasoning. To solve the challenge, previous approaches usually resort to different multimodal interaction modules. Despite their effectiveness, we find that existing methods may exploit a new discovered bias (vision-answer bias) to make answer prediction, leading to suboptimal VQA performances and poor generalization. To solve the issues, we propose a Causality-based Multimodal Interaction Enhancement (CMIE) method, which is model-agnostic and can be seamlessly incorporated into a wide range of VQA approaches in a plug-and-play manner. Specifically, our CMIE contains two key components: a causal intervention module and a counterfactual interaction learning module. The former devotes to removing the spurious correlation between the visual content and the answer caused by the vision-answer bias, and the latter helps capture discriminative inter-modality relationships by directly supervising multimodal interaction training via an interactive loss. Extensive experimental results on three public benchmarks and one reorganized dataset show that the proposed method can significantly improve seven representative VQA models, demonstrating the effectiveness and generalizability of the CMIE.

Abstract:
Source-Free Domain Adaptation (SFDA) is becoming topical to address the challenge of distribution shift between training and deployment data, while also relaxing the requirement of source data availability during target domain adaptation. In this paper, we focus on SFDA for semantic segmentation, in which pseudo labeling based target domain self-training is a common solution. However, pseudo labels generated by the source models are particularly unreliable on the target domain data due to the domain shift issue. Therefore, we propose to use Bayesian Neural Network (BNN) to improve the target self-training by better estimating and exploiting pseudo-label uncertainty. With the uncertainty estimation of BNNs, we introduce two novel self-training based components: Uncertainty-aware Online Teacher-Student Learning (UOTSL) and Uncertainty-aware FeatureMix (UFM). Extensive experiments on two popular benchmarks, GTA 5~\rightarrow Cityscapes and SYNTHIA \rightarrow Cityscapes, show the superiority of our proposed method with mIoU gains of 3.6% and 5.7% over the state-of-the-art respectively.

Abstract:
Standard convolution applied to image inpainting would lead to color discrepancy and blurriness for treating valid and invalid/hole regions without difference, which was partially amended by partial convolution (PConv). In PConv, a binary/hard mask was maintained as an indicator of valid and invalid pixels, where valid pixels and invalid pixels were treated differently. However, it can not describe validity degree of an impaired pixel. In addition, mask and image paths were separated, without sharing convolution kernel and exchanging information mutually, reducing data utilization efficiency. In this paper, a mask-guided convolution (MagConv) is proposed for image inpainting. In MagConv, mask and image paths share a convolution kernel to interact with each other and form a joint optimization scheme. In addition, a learnable piecewise activation function is raised to replace the reciprocal function of PConv, providing more flexible and adaptable compensation to convolution contaminated by invalid pixels. It also results in a soft mask of floating-point coefficients from 0 to 1 capable of indicating the validity degree of each pixel. Last but not least, MagConv splits the convolution kernel into positive and negative weights so that they can evaluate the validity of each pixel faithfully. Qualitative and quantitative experiments on the CelebA, Paris StreetView and Places2 datasets demonstrate that our method achieves favorable visual quality against state-of-the-art approaches.

Abstract:
Considering the spectral properties of images, we propose a new self-attention mechanism with highly reduced computational complexity, up to a linear rate. To better preserve edges while promoting similarity within objects, we propose individualized processes over different frequency bands. In particular, we study a case where the process is merely over low-frequency components. By ablation study, we show that low frequency self-attention can achieve very close or better performance relative to full frequency even without retraining the network. Accordingly, we design and embed novel plug-and-play modules to the head of a CNN network that we refer to as FsaNet. The frequency self-attention 1) requires only a few low frequency coefficients as input, 2) can be mathematically equivalent to spatial domain self-attention with linear structures, 3) simplifies token mapping ( 1× 1 convolution) stage and token mixing stage simultaneously. We show that frequency self-attention requires 87.29% ~ 90.04% less memory, 96.13% ~ 98.07% less FLOPs, and 97.56% ~ 98.18% in run time than the regular self-attention. Compared to other ResNet101-based self-attention networks, FsaNet achieves a new state-of-the-art result (83.0% mIoU) on Cityscape test dataset and competitive results on ADE20k and VOCaug. FsaNet can also enhance MASK R-CNN for instance segmentation on COCO. In addition, utilizing the proposed module, Segformer can be boosted on a series of models with different scales, and Segformer-B5 can be improved even without retraining. Code is accessible at https://github.com/zfy-csu/FsaNet.

Abstract:
Light field (LF) cameras suffer from a fundamental trade-off between spatial and angular resolutions. Additionally, due to the significant amount of data that needs to be recorded, the Lytro ILLUM, a modern LF camera, can only capture three frames per second. In this paper, we consider space-time super-resolution (SR) for LF videos, aiming at generating high-resolution and high-frame-rate LF videos from low-resolution and low-frame-rate observations. Extending existing space-time video SR methods to this task directly will meet two key challenges: 1) how to re-organize sub-aperture images (SAIs) efficiently and effectively given highly redundant LF videos, and 2) how to aggregate complementary information between multiple SAIs and frames considering the coherence in LF videos. To address the above challenges, we propose a novel framework for space-time super-resolving LF videos for the first time. First, we propose a novel Multi-Scale Dilated SAI Re-organization strategy for re-organizing SAIs into auxiliary view stacks with decreasing resolution as the Chebyshev distance in the angular dimension increases. In particular, the auxiliary view stack with original resolution preserves essential visual details, while the down-scaled view stacks capture long-range contextual information. Second, we propose the Multi-Scale Aggregated Feature extractor and the Angular-Assisted Feature Interpolation module to utilize and aggregate information from the spatial, angular, and temporal dimensions in LF videos. The former aggregates similar contents from different SAIs and frames for subsequent reconstruction in a disparity-free manner at the feature level, whereas the latter interpolates intermediate frames temporally by implicitly aggregating geometric information. Compared to other potential approaches, experimental results demonstrate that the reconstructed LF videos generated by our framework achieve higher reconstruction quality and better preserve the LF parallax structure and temporal consistency. The implementation code is available at https://github.com/zeyuxiao1997/LFSTVSR.

Abstract:
In this paper, deep learning-based techniques for film grain removal and synthesis that can be applied in video coding are proposed. Film grain is inherent in analog film content because of the physical process of capturing images and video on film. It can also be present in digital content where it is purposely added to reflect the era of analog film and to evoke certain emotions in the viewer or enhance the perceived quality. In the context of video coding, the random nature of film grain makes it both difficult to preserve and very expensive to compress. To better preserve it while compressing the content efficiently, film grain is removed and modeled before video encoding and then restored after video decoding. In this paper, a film grain removal model based on an encoder-decoder architecture and a film grain synthesis model based on a conditional generative adversarial network (cGAN) are proposed. Both models are trained on a large dataset of pairs of clean (grain-free) and grainy images. Quantitative and qualitative evaluations of the developed solutions were conducted and showed that the proposed film grain removal model is effective in filtering film grain at different intensity levels using two configurations: 1) a non-blind configuration where the film grain level of the grainy input is known and provided as input; and 2) a blind configuration where the film grain level is unknown. As for the film grain synthesis task, the experimental results show that the proposed model is able to reproduce realistic film grain with a controllable intensity level specified as input.

Abstract:
Video frame interpolation is an important low-level vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. However, the spatial redundancy when synthesizing the target frame has not been fully explored, that can result in lots of inefficient computation. On the other hand, the computation compression degree in frame interpolation is highly dependent on both texture distribution and scene motion, which demands to understand the spatial-temporal information of each input frame pair for a better compression degree selection. In this work, we propose a novel two-stage frame interpolation framework termed WaveletVFI to address above problems. It first estimates intermediate optical flow with a lightweight motion perception network, and then a wavelet synthesis network uses flow aligned context features to predict multi-scale wavelet coefficients with sparse convolution for efficient target frame reconstruction, where the sparse valid masks that control computation in each scale are determined by a crucial threshold ratio. Instead of setting a fixed value like previous methods, we find that embedding a classifier in the motion perception network to learn a dynamic threshold for each sample can achieve more computation reduction with almost no loss of accuracy. On the common high resolution and animation frame interpolation benchmarks, proposed WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.

Abstract:
Halftoning aims to reproduce a continuous-tone image with pixels whose intensities are constrained to two discrete levels. This technique has been deployed on every printer, and the majority of them adopt fast methods (e.g., ordered dithering, error diffusion) that fail to render structural details, which determine halftone’s quality. Other prior methods of pursuing visual pleasure by searching for the optimal halftone solution, on the contrary, suffer from their high computational cost. In this paper, we propose a fast and structure-aware halftoning method via a data-driven approach. Specifically, we formulate halftoning as a reinforcement learning problem, in which each binary pixel’s value is regarded as an action chosen by a virtual agent with a shared fully convolutional neural network (CNN) policy. In the offline phase, an effective gradient estimator is utilized to train the agents in producing high-quality halftones in one action step. Then, halftones can be generated online by one fast CNN inference. Besides, we propose a novel anisotropy suppressing loss function, which brings the desirable blue-noise property. Finally, we find that optimizing SSIM could result in holes in flat areas, which can be avoided by weighting the metric with the contone’s contrast map. Experiments show that our framework can effectively train a light-weight CNN, which is 15x faster than previous structure-aware methods, to generate blue-noise halftones with satisfactory visual quality. We also present a prototype of deep multitoning to demonstrate the extensibility of our method.

Abstract:
Camera lenses often suffer from optical aberrations, causing radial distortion in the captured images. In those images, there exists a clear and general physical distortion model. However, in existing solutions, such rich geometric prior is under-utilized, and the formulation of an effective prediction target is under-explored. To this end, we introduce Radial Distortion TRansformer (RDTR), a new framework for radial distortion rectification. Our RDTR includes a model-aware pre-training stage for distortion feature extraction and a deformation estimation stage for distortion rectification. Technically, on the one hand, we formulate the general radial distortion (i.e., barrel distortion and pincushion distortion) in camera-captured images with a shared geometric distortion model and perform a unified model-aware pre-training for its learning. With the pre-training, the network is capable of encoding the specific distortion pattern of a radially distorted image. After that, we transfer the learned representations to the learning of distortion rectification. On the other hand, we introduce a new prediction target called backward warping flow for rectifying images with any resolution while avoiding image defects. Extensive experiments are conducted on our synthetic dataset, and the results demonstrate that our method achieves state-of-the-art performance while operating in real-time. Besides, we also validate the generalization of RDTR on real-world images. Our source code and the proposed dataset are publicly available at https://github.com/wwd-ustc/RDTR.

Abstract:
Interactive object segmentation aims to produce object masks with user interactions, such as clicks, bounding boxes, and scribbles. Click point is the most popular interactive cue for its efficiency, and related deep learning methods have attracted lots of interest in recent years. Most works encode click points as gaussian maps and concatenate them with images as the model’s input. However, the spatial and semantic information of gaussian maps would be noised through multiple convolution layers and won’t be fully exploited by top layers for mask prediction. To pass click information to top layers exactly and efficiently, we propose a coarse mask guided model (CMG) which predicts coarse masks with a coarse module to guide the object mask prediction. Specifically, the coarse module encodes user clicks as query features and enriches their semantic information with backbone features through transformer layers, coarse masks are generated based on the enriched query feature and fed into CMG’s decoder. Benefiting from the efficiency of transformer, CMG’s coarse module and decoder module are lightweight and computationally efficient, making the interaction process more smooth. Experiments on several segmentation benchmarks demonstrate the effectiveness of our method, and we get new state-of-the-art results compared with previous works.

Abstract:
It is desirable to develop efficient image rescaling methods to transmit digital images with different resolutions between devices and assure visual quality. In image downscaling, the inevitable loss of high-frequency information makes the reverse upscaling highly ill-posed. Recent approaches focus on joint learning of image downscaling and upscaling (e.g., rescaling). However, existing methods still fail to recover satisfactory high-frequency signals when upscaling. To solve it, we propose high-frequency flow (HfFlow), which learns the distribution of high-frequency signals during rescaling. HfFlow is an overall invertible framework with a conditional flow on the high-frequency space to compensate for the information lost during downscaling. To facilitate finding the optimal upscaling solution, we introduce a reference low-resolution (LR) manifold and propose a cross-entropy Gaussian loss (CGloss) to force the downscaled manifold closer to the reference LR manifold and simultaneously fulfill recovering missing details. HfFlow can be generalized to other scale transformation tasks such as image colorization with its excellent rescaling capacity. Qualitative and quantitative experimental evaluations demonstrate that HfFlow restores rich high-frequency details and outperforms state-of-the-art rescaling methods in PSNR, SSIM, and perceptual quality metrics.

Abstract:
In a typical image inpainting task, the location and shape of the damaged or masked area is often random and irregular. The vanilla convolutions widely used in learning-based inpainting models treat all spatial features as valid and share parameters across regions, making it difficult for them to cope with those irregular damages, and models tend to produce inpainting results with color discrepancy and blurriness. In this paper, we propose a novel Context Adaptive Network (CANet) to address this issue. The main idea of the proposed CANet is able to generate different weights depending on the miscellaneous input, which may help to complement images with multiple broken forms in a flexible way. Specifically, the proposed CANet has two novel context adaptive modules, namely, the context adaptive block (CAB) and the cross-scale contextual attention (CSCA), which utilize attention mechanisms to cope with diverse content breakdowns. The proposed CAB, during the forward propagation, uses an adaptive term to determine the importance between adaptive term and convolution kernel, so as to dynamically balance features based on the degree of breakage (confidence level or soft mask), and the overall calculation is formulated as a classic convolution implementation with an additional attention term to describe local structure. Besides, the proposed CSCA, not only takes advantage of the contextual attention module, but also considers cross-scale information transfer to generate reasonable features for damaged areas, thus alleviating the inefficiency of the long-range modeling capability of convolutional neural networks. Qualitative and quantitative experiments show that our method performs better than state-of-the-arts, producing clearer, more coherent and visually plausible inpainting results. The code can be found at github.com/dengyecode/CANet_image_inpainting

Abstract:
Existing graph clustering networks heavily rely on a predefined yet fixed graph, which can lead to failures when the initial graph fails to accurately capture the data topology structure of the embedding space. In order to address this issue, we propose a novel clustering network called Embedding-Induced Graph Refinement Clustering Network (EGRC-Net), which effectively utilizes the learned embedding to adaptively refine the initial graph and enhance the clustering performance. To begin, we leverage both semantic and topological information by employing a vanilla auto-encoder and a graph convolution network, respectively, to learn a latent feature representation. Subsequently, we utilize the local geometric structure within the feature embedding space to construct an adjacency matrix for the graph. This adjacency matrix is dynamically fused with the initial one using our proposed fusion architecture. To train the network in an unsupervised manner, we minimize the Jeffreys divergence between multiple derived distributions. Additionally, we introduce an improved approximate personalized propagation of neural predictions to replace the standard graph convolution network, enabling EGRC-Net to scale effectively. Through extensive experiments conducted on nine widely-used benchmark datasets, we demonstrate that our proposed methods consistently outperform several state-of-the-art approaches. Notably, EGRC-Net achieves an improvement of more than 11.99% in Adjusted Rand Index (ARI) over the best baseline on the DBLP dataset. Furthermore, our scalable approach exhibits a 10.73% gain in ARI while reducing memory usage by 33.73% and decreasing running time by 19.71%. The code for EGRC-Net will be made publicly available at https://github.com/ZhihaoPENG-CityU/EGRC-Net.

Abstract:
To robustly detect arbitrary-shaped scene texts, bottom-up methods are widely explored for their flexibility. Due to the highly homogeneous texture and cluttered distribution of scene texts, it is nontrivial for segmentation-based methods to discover the separatrixes between adjacent instances. To effectively separate nearby texts, many methods adopt the seed expansion strategy that segments shrunken text regions as seed areas, and then iteratively expands the seed areas into intact text regions. In seek of a more straightforward way that does not rely on seed area segmentation and avoid possible error accumulation brought by iterative processing, we propose a redundancy removal strategy. In this work, we directly explore two types of fuzzy semantics—text and separatrix—that do not possess specific boundaries, and separate cluttered instances by excluding the separatrix pixels from text regions. To deal with the fuzzy semantic boundaries, we also conduct reliability analysis in both optimization and inference stage to suppress false positive pixels at ambiguous locations. Experiments on benchmark datasets demonstrate the effectiveness of our method.

Abstract:
Unsupervised person re-identification (re-ID) remains a challenging task. While extensive research has focused on the framework design and loss function, this paper shows that sampling strategy plays an equally important role. We analyze the reasons for the performance differences between various sampling strategies under the same framework and loss function. We suggest that deteriorated over-fitting is an important factor causing poor performance, and enhancing statistical stability can rectify this problem. Inspired by that, a simple yet effective approach is proposed, termed group sampling, which gathers samples from the same class into groups. The model is thereby trained using normalized group samples, which helps alleviate the negative impact of individual samples. Group sampling updates the pipeline of pseudo-label generation by guaranteeing that samples are more efficiently classified into the correct classes. It regulates the representation learning process, enhancing statistical stability for feature representation in a progressive fashion. Extensive experiments on Market-1501, DukeMTMC-reID and MSMT17 show that group sampling achieves performance comparable to state-of-the-art methods and outperforms the current techniques under purely camera-agnostic settings. Code has been available at https://github.com/ucas-vg/GroupSampling.

Abstract:
While deep models have proved successful in learning rich knowledge from massive well-annotated data, they may pose a privacy leakage risk in practical deployment. It is necessary to find an effective trade-off between high utility and strong privacy. In this work, we propose a discriminative-generative distillation approach to learn privacy-preserving deep models. Our key idea is taking models as bridge to distill knowledge from private data and then transfer it to learn a student network via two streams. First, discriminative stream trains a baseline classifier on private data and an ensemble of teachers on multiple disjoint private subsets, respectively. Then, generative stream takes the classifier as a fixed discriminator and trains a generator in a data-free manner. After that, the generator is used to generate massive synthetic data which are further applied to train a variational autoencoder (VAE). Among these synthetic data, a few of them are fed into the teacher ensemble to query labels via differentially private aggregation, while most of them are embedded to the trained VAE for reconstructing synthetic data. Finally, a semi-supervised student learning is performed to simultaneously handle two tasks: knowledge transfer from the teachers with distillation on few privately labeled synthetic data, and knowledge enhancement with tangent-normal adversarial regularization on many triples of reconstructed synthetic data. In this way, our approach can control query cost over private data and mitigate accuracy degradation in a unified manner, leading to a privacy-preserving student model. Extensive experiments and analysis clearly show the effectiveness of the proposed approach.

Abstract:
Pooling layers are essential building blocks of convolutional neural networks (CNNs), to reduce computational overhead and increase the receptive fields of proceeding convolutional operations. Their goal is to produce downsampled volumes that closely resemble the input volume while, ideally, also being computationally and memory efficient. Meeting both these requirements remains a challenge. To this end, we propose an adaptive and exponentially weighted pooling method: adaPool. Our method learns a regional-specific fusion of two sets of pooling kernels that are based on the exponent of the Dice-Sørensen coefficient and the exponential maximum, respectively. AdaPool improves the preservation of detail on a range of tasks including image and video classification and object detection. A key property of adaPool is its bidirectional nature. In contrast to common pooling methods, the learned weights can also be used to upsample activation maps. We term this method adaUnPool. We evaluate adaUnPool on image and video super-resolution and frame interpolation. For benchmarking, we introduce Inter4K, a novel high-quality, high frame-rate video dataset. Our experiments demonstrate that adaPool systematically achieves better results across tasks and backbones, while introducing a minor additional computational and memory overhead.

Abstract:
Recent studies of video action recognition can be classified into two categories: the appearance-based methods and the pose-based methods. The appearance-based methods generally cannot model temporal dynamics of large motion well by virtue of optical flow estimation, while the pose-based methods ignore the visual context information such as typical scenes and objects, which are also important cues for action understanding. In this paper, we tackle these problems by proposing a Pose-Appearance Relational Network (PARNet), which models the correlation between human pose and image appearance, and combines the benefits of these two modalities to improve the robustness towards unconstrained real-world videos. There are three network streams in our model, namely pose stream, appearance stream and relation stream. For the pose stream, a Temporal Multi-Pose RNN module is constructed to obtain the dynamic representations through temporal modeling of 2D poses. For the appearance stream, a Spatial Appearance CNN module is employed to extract the global appearance representation of the video sequence. For the relation stream, a Pose-Aware RNN module is built to connect pose and appearance streams by modeling action-sensitive visual context information. Through jointly optimizing the three modules, PARNet achieves superior performances compared with the state-of-the-arts on both the pose-complete datasets (KTH, Penn-Action, UCF11) and the challenging pose-incomplete datasets (UCF101, HMDB51, JHMDB), demonstrating its robustness towards complex environments and noisy skeletons. Its effectiveness on NTU-RGBD dataset is also validated even compared with 3D skeleton-based methods. Furthermore, an appearance-enhanced PARNet equipped with a RGB-based I3D stream is proposed, which outperforms the Kinetics pre-trained competitors on UCF101 and HMDB51. The better experimental results verify the potentials of our framework by integrating various modules.

Abstract:
Few-shot object detection (FSOD), which aims at learning a generic detector that can adapt to unseen tasks with scarce training samples, has witnessed consistent improvement recently. However, most existing methods ignore the efficiency issues, e.g., high computational complexity and slow adaptation speed. Notably, efficiency has become an increasingly important evaluation metric for few-shot techniques due to an emerging trend toward embedded AI. To this end, we present an efficient pretrain-transfer framework (PTF) baseline with no computational increment, which achieves comparable results with previous state-of-the-art (SOTA) methods. Upon this baseline, we devise an initializer named knowledge inheritance (KI) to reliably initialize the novel weights for the box classifier, which effectively facilitates the knowledge transfer process and boosts the adaptation speed. Within the KI initializer, we propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights. Finally, our approach not only achieves the SOTA results across three public benchmarks, i.e., PASCAL VOC, COCO and LVIS, but also exhibits high efficiency with 1.8-100× faster adaptation speed against the other methods on COCO/LVIS benchmark during few-shot transfer. To our best knowledge, this is the first work to consider the efficiency problem in FSOD. We hope to motivate a trend toward powerful yet efficient few-shot technique development. The codes are publicly available at https://github.com/Ze-Yang/Efficient-FSOD.

Abstract:
Human motion segmentation (HMS) aims to segment a long human action video into a bunch of short and meaningful action clips. Existing supervised learning approaches need a large amount of training data which may be costly in real-world scenario, while most unsupervised clustering methods cannot fully explore the temporal correlations among human motions and hard to achieve promising performances. In our paper, we design a novel unsupervised framework, called Velocity-Sensitive Dual-Side Auto-Encoder (VSDA), for HMS task. Specifically, a multi-neighbor auto-encoder (MNA) is proposed to extract informative temporal features, which fully explores the local temporal patterns of human motions. In addition, a long-short distance encoding (LSE) strategy is designed. It constrains the encoded representations of close (short-distance) frames becoming similar while the representations of far-away (long-distance) frames becoming distinctive. Similarly, this strategy is also deployed on the decoded outputs as the long-short distance decoding (LSD) module. The LSE/LSD guides the learning process explicitly and implicitly to achieve the dual-side structure. Moreover, we consider the energy variations during the human motion to propose the velocity-sensitive (VS) guidance mechanism for further model improvement. VSDA leverages the temporal characteristics of human motion and derives promising HMS performance. Comprehensive experiments on six real-world human motion datasets illustrate the effectiveness of our proposed model.

Abstract:
Copy prediction is a renowned category of prediction techniques in video coding where the current block is predicted by copying the samples from a similar block that is present somewhere in the already decoded stream of samples. Motion-compensated prediction, intra block copy, template matching prediction etc. are examples. While the displacement information of the similar block is transmitted to the decoder in the bit-stream in the first two approaches, it is derived at the decoder in the last one by repeating the same search algorithm which was carried out at the encoder. Region-based template matching is a recently developed prediction algorithm that is an advanced form of standard template matching. In this method, the reference area is partitioned into multiple regions and the region to be searched for the similar block(s) is conveyed to the decoder in the bit-stream. Further, its final prediction signal is a linear combination of already decoded similar blocks from the given region. It was demonstrated in previous publications that region-based template matching is capable of achieving coding efficiency improvements for intra as well as inter-picture coding with considerably less decoder complexity than conventional template matching. In this paper, a theoretical justification for region-based template matching prediction subject to experimental data is presented. Additionally, the test results of the aforementioned method on the latest H.266/Versatile Video Coding (VVC) test model (version VTM-14.0) yield an average Bjøntegaard-Delta (BD) bit-rate savings of -0.75% using all intra (AI) configuration with 130% encoder run-time and 104% decoder run-time for a particular parameter selection.

Abstract:
Characterized by tremendous spectral information, hyperspectral image is able to detect subtle changes and discriminate various change classes for change detection. The recent research works dominated by hyperspectral binary change detection, however, cannot provide fine change classes information. And most methods incorporating spectral unmixing for hyperspectral multiclass change detection (HMCD), yet suffer from the neglection of temporal correlation and error accumulation. In this study, we proposed an unsupervised Binary Change Guided hyperspectral multiclass change detection Network (BCG-Net) for HMCD, which aims at boosting the multiclass change detection result and unmixing result with the mature binary change detection approaches. In BCG-Net, a novel partial-siamese united-unmixing module is designed for multi-temporal spectral unmixing, and a groundbreaking temporal correlation constraint directed by the pseudo-labels of binary change detection result is developed to guide the unmixing process from the perspective of change detection, encouraging the abundance of the unchanged pixels more coherent and that of the changed pixels more accurate. Moreover, an innovative binary change detection rule is put forward to deal with the problem that traditional rule is susceptible to numerical values. The iterative optimization of the spectral unmixing process and the change detection process is proposed to eliminate the accumulated errors and bias from unmixing result to change detection result. The experimental results demonstrate that our proposed BCG-Net could achieve comparative or even outstanding performance of multiclass change detection among the state-of-the-art approaches and gain better spectral unmixing results at the same time.

Abstract:
Anomaly detection is important in many real-life applications. Recently, self-supervised learning has greatly helped deep anomaly detection by recognizing several geometric transformations. However these methods lack finer features, usually highly depend on the anomaly type, and do not perform well on fine-grained problems. To address these issues, we first introduce in this work three novel and efficient discriminative and generative tasks which have complementary strength: (i) a piece-wise jigsaw puzzle task focuses on structure cues; (ii) a tint rotation recognition is used within each piece, taking into account the colorimetry information; (iii) and a partial re-colorization task considers the image texture. In order to make the re-colorization task more object-oriented than background-oriented, we propose to include the contextual color information of the image border via an attention mechanism. We then present a new out-of-distribution detection function and highlight its better stability compared to existing methods. Along with it, we also experiment different score fusion functions. Finally, we evaluate our method on an extensive protocol composed of various anomaly types, from object anomalies, style anomalies with fine-grained classification to local anomalies with face anti-spoofing datasets. Our model significantly outperforms state-of-the-art with up to 36% relative error improvement on object anomalies and 40% on face anti-spoofing problems.

Abstract:
Unsupervised feature selection chooses a subset of discriminative features to reduce feature dimension under the unsupervised learning paradigm. Although lots of efforts have been made so far, existing solutions perform feature selection either without any label guidance or with only single pseudo label guidance. They may cause significant information loss and lead to semantic shortage of the selected features as many real-world data, such as images and videos are generally annotated with multiple labels. In this paper, we propose a new Unsupervised Adaptive Feature Selection with Binary Hashing (UAFS-BH) model, which learns binary hash codes as weakly-supervised multi-labels and simultaneously exploits the learned labels to guide feature selection. Specifically, in order to exploit the discriminative information under the unsupervised scenarios, the weakly-supervised multi-labels are learned automatically by specially imposing binary hash constraints on the spectral embedding process to guide the ultimate feature selection. The number of weakly-supervised multi-labels (the number of “1” in binary hash codes) is adaptively determined according to the specific data content. Further, to enhance the discriminative capability of binary labels, we model the intrinsic data structure by adaptively constructing the dynamic similarity graph. Finally, we extend UAFS-BH to multi-view setting as Multi-view Feature Selection with Binary Hashing (MVFS-BH) to handle the multi-view feature selection problem. An effective binary optimization method based on the Augmented Lagrangian Multiple (ALM) is derived to iteratively solve the formulated problem. Extensive experiments on widely tested benchmarks demonstrate the state-of-the-art performance of the proposed method on both single-view and multi-view feature selection tasks. For the purpose of reproducibility, we provide the source codes and testing datasets at https://github.com/shidan0122/UMFS.git..

Abstract:
Few-shot Class-Incremental Learning (FSCIL) aims at learning new concepts continually with only a few samples, which is prone to suffer the catastrophic forgetting and overfitting problems. The inaccessibility of old classes and the scarcity of the novel samples make it formidable to realize the trade-off between retaining old knowledge and learning novel concepts. Inspired by that different models memorize different knowledge when learning novel concepts, we propose a Memorizing Complementation Network (MCNet) to ensemble multiple models that complements the different memorized knowledge with each other in novel tasks. Additionally, to update the model with few novel samples, we develop a Prototype Smoothing Hard-mining Triplet (PSHT) loss to push the novel samples away from not only each other in current task but also the old distribution. Extensive experiments on three benchmark datasets, e.g., CIFAR100, miniImageNet and CUB200, have demonstrated the superiority of our proposed method.

Abstract:
Transformers are more and more popular in computer vision, which treat an image as a sequence of patches and learn robust global features from the sequence. However, pure transformers are not entirely suitable for vehicle re-identification because vehicle re-identification requires both robust global features and discriminative local features. For that, a graph interactive transformer (GiT) is proposed in this paper. In the macro view, a list of GiT blocks are stacked to build a vehicle re-identification model, in where graphs are to extract discriminative local features within patches and transformers are to extract robust global features among patches. In the micro view, graphs and transformers are in an interactive status, bringing effective cooperation between local and global features. Specifically, one current graph is embedded after the former level’s graph and transformer, while the current transform is embedded after the current graph and the former level’s transformer. In addition to the interaction between graphs and transforms, the graph is a newly-designed local correction graph, which learns discriminative local features within a patch by exploring nodes’ relationships. Extensive experiments on three large-scale vehicle re-identification datasets demonstrate that our GiT method is superior to state-of-the-art vehicle re-identification approaches.

Abstract:
This paper addresses the problem of face video inpainting. Existing video inpainting methods target primarily at natural scenes with repetitive patterns. They do not make use of any prior knowledge of the face to help retrieve correspondences for the corrupted face. They therefore only achieve sub-optimal results, particularly for faces under large pose and expression variations where face components appear very differently across frames. In this paper, we propose a two-stage deep learning method for face video inpainting. We employ 3DMM as our 3D face prior to transform a face between the image space and the UV (texture) space. In Stage I, we perform face inpainting in the UV space. This helps to largely remove the influence of face poses and expressions and makes the learning task much easier with well aligned face features. We introduce a frame-wise attention module to fully exploit correspondences in neighboring frames to assist the inpainting task. In Stage II, we transform the inpainted face regions back to the image space and perform face video refinement that inpaints any background regions not covered in Stage I and also refines the inpainted face regions. Extensive experiments have been carried out which show our method can significantly outperform methods based merely on 2D information, especially for faces under large pose and expression variations. Project page: https://ywq.github.io/FVIP.

Abstract:
In the past several years, various adversarial training (AT) approaches have been invented to robustify deep learning model against adversarial attacks. However, mainstream AT methods assume the training and testing data are drawn from the same distribution and the training data are annotated. When the two assumptions are violated, existing AT methods fail because either they cannot pass knowledge learnt from a source domain to an unlabeled target domain or they are confused by the adversarial samples in that unlabeled space. In this paper, we first point out this new and challenging problem— adversarial training in unlabeled target domain. We then propose a novel framework named Unsupervised Cross-domain Adversarial Training (UCAT) to address this problem. UCAT effectively leverages the knowledge of the labeled source domain to prevent the adversarial samples from misleading the training process, under the guidance of automatically selected high quality pseudo labels of the unannotated target domain data together with the discriminative and robust anchor representations of the source domain data. The experiments on four public benchmarks show that models trained with UCAT can achieve both high accuracy and strong robustness. The effectiveness of the proposed components is demonstrated through a large set of ablation studies. The source code is publicly available at https://github.com/DIAL-RPI/UCAT.

Abstract:
Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images, while simultaneously boost the performance of text recognition. However, most of the existing STISR methods regard text images as natural scene images, ignoring the categorical information of text. In this paper, we make an inspiring attempt to embed text recognition prior into STISR model. Specifically, we adopt the predicted character recognition probability sequence as the text prior, which can be obtained conveniently from a text recognition model. The text prior provides categorical guidance to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR) framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not only effectively improve the visual quality of scene text images, but also significantly improve the text recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates certain generalization capability to the LR images in other datasets. The source code of our work is available at: https://github.com/mjq11302010044/TPGSR.

Abstract:
Full-reference image quality measures are a fundamental tool to approximate the human visual system in various applications for digital data management: from retrieval to compression to detection of unauthorized uses. Inspired by both the effectiveness and the simplicity of hand-crafted Structural Similarity Index Measure (SSIM), in this work, we present a framework for the formulation of SSIM-like image quality measures through genetic programming. We explore different terminal sets, defined from the building blocks of structural similarity at different levels of abstraction, and we propose a two-stage genetic optimization that exploits hoist mutation to constrain the complexity of the solutions. Our optimized measures are selected through a cross-dataset validation procedure, which results in superior performance against different versions of structural similarity, measured as correlation with human mean opinion scores. We also demonstrate how, by tuning on specific datasets, it is possible to obtain solutions that are competitive with (or even outperform) more complex image quality measures.

Abstract:
Domain generalization (DG) aims to learn transferable knowledge from multiple source domains and generalize it to the unseen target domain. To achieve such expectation, the intuitive solution is to seek domain-invariant representations via generative adversarial mechanism or minimization of cross-domain discrepancy. However, the widespread imbalanced data scale problem across source domains and category in real-world applications becomes the key bottleneck of improving generalization ability of model due to its negative effect on learning the robust classification model. Motivated by this observation, we first formulate a practical and challenging imbalance domain generalization (IDG) scenario, and then propose a straightforward but effective novel method generative inference network (GINet), which augments reliable samples for minority domain/category to promote discriminative ability of the learned model. Concretely, GINet utilizes the available cross-domain images from the identical category and estimates their common latent variable, which derives to discover domain-invariant knowledge for unseen target domain. According to these latent variables, our GINet further generates more novel samples with optimal transport constraint and deploys them to enhance the desired model with more robustness and generalization ability. Considerable empirical analysis and ablation studies on three popular benchmarks under normal DG and IDG setups suggests the advantage of our method over other DG methods on elevating model generalization. The source code is available in GitHub https://github.com/HaifengXia/IDG.

Affiliations: Center for Applied Mathematics, Tianjin University, Tianjin, China; Department of Mathematical Sciences, Liverpool Centre of Mathematics for Healthcare and Centre for Mathematical Imaging Techniques, University of Liverpool, Liverpool, U.K.; Department of Computer Science and Engineering, Guangdong Key Laboratory of Brain-Inspired Intelligent Computation, Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology, Shenzhen, China

Abstract:
The geometric high-order regularization methods such as mean curvature and Gaussian curvature, have been intensively studied during the last decades due to their abilities in preserving geometric properties including image edges, corners, and contrast. However, the dilemma between restoration quality and computational efficiency is an essential roadblock for high-order methods. In this paper, we propose fast multi-grid algorithms for minimizing both mean curvature and Gaussian curvature energy functionals without sacrificing accuracy for efficiency. Unlike the existing approaches based on operator splitting and the Augmented Lagrangian method (ALM), no artificial parameters are introduced in our formulation, which guarantees the robustness of the proposed algorithm. Meanwhile, we adopt the domain decomposition method to promote parallel computing and use the fine-to-coarse structure to accelerate convergence. Numerical experiments are presented on image denoising, CT, and MRI reconstruction problems to demonstrate the superiority of our method in preserving geometric structures and fine details. The proposed method is also shown effective in dealing with large-scale image processing problems by recovering an image of size 1024× 1024 within 40s, while the ALM-based method requires around 200s.

Abstract:
With the development of video network, image set classification (ISC) has received a lot of attention and can be used for various practical applications, such as video based recognition, action recognition, and so on. Although the existing ISC methods have obtained promising performance, they often have extreme high complexity. Due to the superiority in storage space and complexity cost, learning to hash becomes a powerful solution scheme. However, existing hashing methods often ignore complex structural information and hierarchical semantics of the original features. They usually adopt a single-layer hashing strategy to transform high-dimensional data into short-length binary codes in one step. This sudden drop of dimension could result in the loss of advantageous discriminative information. In addition, they do not take full advantage of intrinsic semantic knowledge from whole gallery sets. To tackle these problems, in this paper, we propose a novel Hierarchical Hashing Learning (HHL) for ISC. Specifically, a coarse-to-fine hierarchical hashing scheme is proposed that utilizes a two-layer hash function to gradually refine the beneficial discriminative information in a layer-wise fashion. Besides, to alleviate the effects of redundant and corrupted features, we impose the \ell _2,1 norm on the layer-wise hash function. Moreover, we adopt a bidirectional semantic representation with the orthogonal constraint to keep intrinsic semantic information of all samples in whole image sets adequately. Comprehensive experiments demonstrate HHL acquires significant improvements in accuracy and running time. We will release the demo code on https://github.com/sunyuan-cs.

Affiliations: National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University, Beijing, China; Pattern Recognition and Intelligent System Laboratory, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan

Abstract:
Taking photos with digital cameras often accompanies saturated pixels due to their limited dynamic range, and it is far too ill-posed to restore them. Capturing multiple low dynamic range images with bracketed exposures can make the problem less ill-posed, however, it is prone to ghosting artifacts caused by spatial misalignment among images. A polarization camera can capture four spatially-aligned and temporally-synchronized polarized images with different polarizer angles in a single shot, which can be used for ghost-free high dynamic range (HDR) reconstruction. However, real-world scenarios are still challenging since existing polarization-based HDR reconstruction methods treat all pixels in the same manner and only utilize the spatially-variant exposures of the polarized images (without fully exploiting the degree of polarization (DoP) and the angle of polarization (AoP) of the incoming light to the sensor, which encode abundant structural and contextual information of the scene) to handle the problem still in an ill-posed manner. In this paper, we propose a pixel-wise depolarization strategy to solve the polarization guided HDR reconstruction problem, by classifying the pixels based on their levels of ill-posedness in HDR reconstruction procedure and applying different solutions to different classes. To utilize the strategy with better generalization ability and higher robustness, we propose a network-physics-hybrid polarization-based HDR reconstruction pipeline along with a neural network tailored to it, fully exploiting the DoP and AoP. Experimental results show that our approach achieves state-of-the-art performance on both synthetic and real-world images.

Abstract:
Due to the difficulty of collecting paired Low-Resolution (LR) and High-Resolution (HR) images, the recent research on single image Super-Resolution (SR) has often been criticized for the data bottleneck of the synthetic image degradation between LRs and HRs. Recently, the emergence of real-world SR datasets, e.g., RealSR and DRealSR, promotes the exploration of Real-World image Super-Resolution (RWSR). RWSR exposes a more practical image degradation, which greatly challenges the learning capacity of deep neural networks to reconstruct high-quality images from low-quality images collected in realistic scenarios. In this paper, we explore Taylor series approximation in prevalent deep neural networks for image reconstruction, and propose a very general Taylor architecture to derive Taylor Neural Networks (TNNs) in a principled manner. Our TNN builds Taylor Modules with Taylor Skip Connections (TSCs) to approximate the feature projection functions, following the spirit of Taylor Series. TSCs introduce the input connected directly with each layer at different layers, to sequentially produces different high-order Taylor maps to attend more image details, and then aggregate the different high-order information from different layers. Only via simple skip connections, TNN is compatible with various existing neural networks to effectively learn high-order components of the input image with little increase of parameters. Furthermore, we have conducted extensive experiments to evaluate our TNNs in different backbones on two RWSR benchmarks, which achieve a superior performance in comparison with existing baseline methods.

Abstract:
Knowledge amalgamation (KA) is a novel deep model reusing task aiming to transfer knowledge from several well-trained teachers to a multi-talented and compact student. Currently, most of these approaches are tailored for convolutional neural networks (CNNs). However, there is a tendency that Transformers, with a completely different architecture, are starting to challenge the domination of CNNs in many computer vision tasks. Nevertheless, directly applying the previous KA methods to Transformers leads to severe performance degradation. In this work, we explore a more effective KA scheme for Transformer-based object detection models. Specifically, considering the architecture characteristics of Transformers, we propose to dissolve the KA into two aspects: sequence-level amalgamation (SA) and task-level amalgamation (TA). In particular, a hint is generated within the sequence-level amalgamation by concatenating teacher sequences instead of redundantly aggregating them to a fixed-size one as previous KA approaches. Besides, the student learns heterogeneous detection tasks through soft targets with efficiency in the task-level amalgamation. Extensive experiments on PASCAL VOC and COCO have unfolded that the sequence-level amalgamation significantly boosts the performance of students, while the previous methods impair the students. Moreover, the Transformer-based students excel in learning amalgamated knowledge, as they have mastered heterogeneous detection tasks rapidly and achieved superior or at least comparable performance to those of the teachers in their specializations.

Abstract:
Visual intention understanding is the task of exploring the potential and underlying meaning expressed in images. Simply modeling the objects or backgrounds within the image content leads to unavoidable comprehension bias. To alleviate this problem, this paper proposes a Cross-modality Pyramid Alignment with Dynamic optimization (CPAD) to enhance the global understanding of visual intention with hierarchical modeling. The core idea is to exploit the hierarchical relationship between visual content and textual intention labels. For visual hierarchy, we formulate the visual intention understanding task as a hierarchical classification problem, capturing multiple granular features in different layers, which corresponds to hierarchical intention labels. For textual hierarchy, we directly extract the semantic representation from intention labels at different levels, which supplements the visual content modeling without extra manual annotations. Moreover, to further narrow the domain gap between different modalities, a cross-modality pyramid alignment module is designed to dynamically optimize the performance of visual intention understanding in a joint learning manner. Comprehensive experiments intuitively demonstrate the superiority of our proposed method, outperforming existing visual intention understanding methods.

Abstract:
We present Twist, a simple and theoretically explainable self-supervised representation learning method by classifying large-scale unlabeled datasets in an end-to-end way. We employ a siamese network terminated by a softmax operation to produce twin class distributions of two augmented images. Without supervision, we enforce the class distributions of different augmentations to be consistent. However, simply minimizing the divergence between augmentations will generate collapsed solutions, i.e., outputting the same class distribution for all images. In this case, little information about the input images is preserved. To solve this problem, we propose to maximize the mutual information between the input image and the output class predictions. Specifically, we minimize the entropy of the distribution for each sample to make the class prediction assertive, and maximize the entropy of the mean distribution to make the predictions of different samples diverse. In this way, Twist can naturally avoid the collapsed solutions without specific designs such as asymmetric network, stop-gradient operation, or momentum encoder. As a result, Twist outperforms previous state-of-the-art methods on a wide range of tasks. Specifically on the semi-supervised classification task, Twist achieves 61.2% top-1 accuracy with 1% ImageNet labels using a ResNet-50 as backbone, surpassing previous best results by an improvement of 6.2%. Codes and pre-trained models are available at https://github.com/bytedance/TWIST

Abstract:
Effective assisted living environments must be able to infer how their occupants interact in a variety of scenarios. Gaze direction provides strong indications of how a person engages with the environment and its occupants. In this paper, we investigate the problem of gaze tracking in multi-camera assisted living environments. We propose a gaze tracking method based on predictions generated by a neural network regressor that relies only on the relative positions of facial keypoints to estimate gaze. For each gaze prediction, our regressor also provides an estimate of its own uncertainty, which is used to weigh the contribution of previously estimated gazes within a tracking framework based on an angular Kalman filter. Our gaze estimation neural network uses confidence gated units to alleviate keypoint prediction uncertainties in scenarios involving partial occlusions or unfavorable views of the subjects. We evaluate our method using videos from the MoDiPro dataset, which we acquired in a real assisted living facility, and on the publicly available MPIIFaceGaze, GazeFollow, and Gaze360 datasets. Experimental results show that our gaze estimation network outperforms sophisticated state-of-the-art methods, while additionally providing uncertainty predictions that are highly correlated with the actual angular error of the corresponding estimates. Finally, an analysis of the temporal integration performance of our method demonstrates that it generates accurate and temporally stable gaze predictions.

Abstract:
Few-shot learning is proposed to tackle the problem of scarce training data in novel classes. However, prior works in instance-level few-shot learning have paid less attention to effectively utilizing the relationship between categories. In this paper, we exploit the hierarchical information to leverage discriminative and relevant features of base classes to effectively classify novel objects. These features are extracted from abundant data of base classes, which could be utilized to reasonably describe classes with scarce data. Specifically, we propose a novel superclass approach that automatically creates a hierarchy considering base and novel classes as fine-grained classes for few-shot instance segmentation (FSIS). Based on the hierarchical information, we design a novel framework called Soft Multiple Superclass (SMS) to extract relevant features or characteristics of classes in the same superclass. A new class assigned to the superclass is easier to classify by leveraging these relevant features. Besides, in order to effectively train the hierarchy-based-detector in FSIS, we apply the label refinement to further describe the associations between fine-grained classes. The extensive experiments demonstrate the effectiveness of our method on FSIS benchmarks. The source code is available here: https://github.com/nvakhoa/superclass-FSIS

Abstract:
Night-Time Scene Parsing (NTSP) is essential to many vision applications, especially for autonomous driving. Most of the existing methods are proposed for day-time scene parsing. They rely on modeling pixel intensity-based spatial contextual cues under even illumination. Hence, these methods do not perform well in night-time scenes as such spatial contextual cues are buried in the over-/under-exposed regions in night-time scenes. In this paper, we first conduct an image frequency-based statistical experiment to interpret the day-time and night-time scene discrepancies. We find that image frequency distributions differ significantly between day-time and night-time scenes, and understanding such frequency distributions is critical to NTSP problem. Based on this, we propose to exploit the image frequency distributions for night-time scene parsing. First, we propose a Learnable Frequency Encoder (LFE) to model the relationship between different frequency coefficients to measure all frequency components dynamically. Second, we propose a Spatial Frequency Fusion module (SFF) that fuses both spatial and frequency information to guide the extraction of spatial context features. Extensive experiments show that our method performs favorably against the state-of-the-art methods on the NightCity, NightCity+ and BDD100K-night datasets. In addition, we demonstrate that our method can be applied to existing day-time scene parsing methods and boost their performance on night-time scenes. The code is available at https://github.com/wangsen99/FDLNet.

Abstract:
We study the use of predictive approaches alongside the region-adaptive hierarchical transform (RAHT) in attribute compression of dynamic point clouds. The use of intra-frame prediction with RAHT was shown to improve attribute compression performance over pure RAHT and represents the state-of-the-art in attribute compression of point clouds, being part of MPEG’s geometry-based test model. We studied a combination of inter-frame and intra-frame prediction for RAHT for the compression of dynamic point clouds. An adaptive zero-motion-vector (ZMV) scheme and an adaptive motion-compensated scheme are developed. The simple adaptive ZMV approach is able to achieve sizable gains over pure RAHT and over the intra-frame predictive RAHT (I-RAHT) for point clouds with little or no motion while ensuring similar compression performance to I-RAHT for point clouds with intense motion. The motion-compensated approach, more complex and more powerful, is able to achieve large gains across all of the tested dynamic point clouds.

Abstract:
Non-maximum suppression (NMS) is a post-processing step in almost every visual object detector. NMS aims to prune the number of overlapping detected candidate regions-of-interest (RoIs) on an image, in order to assign a single and spatially accurate detection to each object. The default NMS algorithm (GreedyNMS) is fairly simple and suffers from severe drawbacks, due to its need for manual tuning. A typical case of failure with high application relevance is pedestrian/person detection in the presence of occlusions, where GreedyNMS doesn’t provide accurate results. This paper proposes an efficient deep neural architecture for NMS in the person detection scenario, by capturing relations of neighboring RoIs and aiming to ideally assign precisely one detection per person. The presented Seq2Seq-NMS architecture assumes a sequence-to-sequence formulation of the NMS problem, exploits the Multihead Scale-Dot Product Attention mechanism and jointly processes both geometric and visual properties of the input candidate RoIs. Thorough experimental evaluation on three public person detection datasets shows favourable results against competing methods, with acceptable inference runtime requirements.

Abstract:
Robust keypoint detection on omnidirectional images against large perspective variations, is a key problem in many computer vision tasks. In this paper, we propose a perspectively equivariant keypoint learning framework named OmniKL for addressing this problem. Specifically, the framework is composed of a perspective module and a spherical module, each one including a keypoint detector specific to the type of the input image and a shared descriptor providing uniform description for omnidirectional and perspective images. In these detectors, we propose a differentiable candidate position sorting operation for localizing keypoints, which directly sorts the scores of the candidate positions in a differentiable manner and returns the globally top-K keypoints on the image. This approach does not break the differentiability of the two modules, thus they are end-to-end trainable. Moreover, we design a novel training strategy combining the self-supervised and co-supervised methods to train the framework without any labeled data. Extensive experiments on synthetic and real-world 360° image datasets demonstrate the effectiveness of OmniKL in detecting perspectively equivariant keypoints on omnidirectional images. Our source code are available online at https://github.com/vandeppce/sphkpt.

Abstract:
Change captioning is to describe the fine-grained change between a pair of images. The pseudo changes caused by viewpoint changes are the most typical distractors in this task, because they lead to the feature perturbation and shift for the same objects and thus overwhelm the real change representation. In this paper, we propose a viewpoint-adaptive representation disentanglement network to distinguish real and pseudo changes, and explicitly capture the features of change to generate accurate captions. Concretely, a position-embedded representation learning is devised to facilitate the model in adapting to viewpoint changes via mining the intrinsic properties of two image representations and modeling their position information. To learn a reliable change representation for decoding into a natural language sentence, an unchanged representation disentanglement is designed to identify and disentangle the unchanged features between the two position-embedded representations. Extensive experiments show that the proposed method achieves the state-of-the-art performance on the four public datasets. The code is available at https://github.com/tuyunbin/VARD.

Abstract:
This paper presents a matching network to establish point correspondence between images. We propose a Multi-Arm Network (MAN) capable of learning region overlap and depth, which can greatly improve keypoint matching robustness while bringing an extra 50% of computational time during the inference stage. By adopting a different design from the state-of-the-art learning based pipeline SuperGlue framework, which requires retraining when a different keypoint detector is adopted, our network can directly work with different keypoint detectors without time-consuming retraining processes. Comprehensive experiments conducted on four public benchmarks involving both outdoor and indoor scenarios demonstrate that our proposed MAN outperforms state-of-the-art methods.

Affiliations: National Engineering Research Center for Multimedia Software, School of Computer Science, Institute of Artificial Intelligence, Wuhan University, Wuhan, China; Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan, China; Department of Information Engineering and Computer Science, University of Trento, Trento, Italy; Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan

Abstract:
Recent person Re-IDentification (ReID) systems have been challenged by changes in personnel clothing, leading to the study of Cloth-Changing person ReID (CC-ReID). Commonly used techniques involve incorporating auxiliary information (e.g., body masks, gait, skeleton, and keypoints) to accurately identify the target pedestrian. However, the effectiveness of these methods heavily relies on the quality of auxiliary information and comes at the cost of additional computational resources, ultimately increasing system complexity. This paper focuses on achieving CC-ReID by effectively leveraging the information concealed within the image. To this end, we introduce an Auxiliary-free Competitive IDentification (ACID) model. It achieves a win-win situation by enriching the identity (ID)-preserving information conveyed by the appearance and structure features while maintaining holistic efficiency. In detail, we build a hierarchical competitive strategy that progressively accumulates meticulous ID cues with discriminating feature extraction at the global, channel, and pixel levels during model inference. After mining the hierarchical discriminative clues for appearance and structure features, these enhanced ID-relevant features are crosswise integrated to reconstruct images for reducing intra-class variations. Finally, by combing with self- and cross-ID penalties, the ACID is trained under a generative adversarial learning framework to effectively minimize the distribution discrepancy between the generated data and real-world data. Experimental results on four public cloth-changing datasets (i.e., PRCC-ReID, VC-Cloth, LTCC-ReID, and Celeb-ReID) demonstrate the proposed ACID can achieve superior performance over state-of-the-art methods. The code is available soon at: https://github.com/BoomShakaY/Win-CCReID.

Affiliations: State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; College of Engineering and Computer Science, Australian National University, Canberra, ACT, Australia; Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia, School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, China

Abstract:
In this paper, we address the problem of video-based rain streak removal by developing an event-aware multi-patch progressive neural network. Rain streaks in video exhibit correlations in both temporal and spatial dimensions. Existing methods have difficulties in modeling the characteristics. Based on the observation, we propose to develop a module encoding events from neuromorphic cameras to facilitate deraining. Events are captured asynchronously at pixel-level only when intensity changes by a margin exceeding a certain threshold. Due to this property, events contain considerable information about moving objects including rain streaks passing though the camera across adjacent frames. Thus we suggest that utilizing it properly facilitates deraining performance non-trivially. In addition, we develop a multi-patch progressive neural network. The multi-patch manner enables various receptive fields by partitioning patches and the progressive learning in different patch levels makes the model emphasize each patch level to a different extent. Extensive experiments show that our method guided by events outperforms the state-of-the-art methods by a large margin in synthetic and real-world datasets.

Abstract:
Although deep learning-based (DL-based) image processing algorithms have achieved superior performance, they are still difficult to apply on mobile devices (e.g., smartphones and cameras) due to the following reasons: 1) the high memory demand and 2) large model size. To adapt DL-based methods to mobile devices, motivated by the characteristics of image signal processors (ISPs), we propose a novel algorithm named LineDL. In LineDL, the default mode of the whole-image processing is reformulated as a line-by-line mode, eliminating the need to store large amounts of intermediate data for the whole image. An information transmission module (ITM) is designed to extract and convey the interline correlation and integrate the interline features. Furthermore, we develop a model compression method to reduce the model size while maintaining competitive performance; that is, knowledge is redefined, and compression is performed in two directions. We evaluate LineDL on general image processing tasks, including denoising and superresolution. The extensive experimental results demonstrate that LineDL achieves image quality comparable to that of state-of-the-art (SOTA) DL-based algorithms with a much smaller memory demand and competitive model size.

Abstract:
Saturation information in hazy images is conducive to effective haze removal, However, existing saturation-based dehazing methods just focus on the saturation value of each pixel itself, while the higher-level distribution characteristic between pixels regarding saturation remains to be harnessed. In this paper, we observe that the pixels, which share the same surface reflectance coefficient in the local patches of haze-free images, exhibit a linear relationship between their saturation component and the reciprocal of their brightness component in the corresponding hazy images normalized by atmospheric light. Furthermore, the intercept of the line described by this linear relationship on the saturation axis is exactly the saturation value of these pixels in the haze-free images. Using this characteristic of saturation, termed saturation line prior (SLP), the transmission estimation is translated into the construction of saturation lines. Accordingly, a new dehazing framework using SLP is proposed, which employs the intrinsic relevance between pixels to achieve a reliable saturation line construction for transmission estimation. This approach can recover the fine details and attain realistic colors from hazy scenes, resulting in a remarkable visibility improvement. Extensive experiments in real-world and synthetic hazy images show that the proposed method performs favorably against state-of-the-art dehazing methods. Code is available on https://github.com/LPengYang/Saturation-Line-Prior.

Abstract:
Generalized Few-shot Semantic Segmentation (GFSS) aims to segment each image pixel into either base classes with abundant training examples or novel classes with only a handful of (e. g., 1-5) training images per class. Compared to the widely studied Few-shot Semantic Segmentation (FSS), which is limited to segmenting novel classes only, GFSS is much under-studied despite being more practical. Existing approach to GFSS is based on classifier parameter fusion whereby a newly trained novel class classifier and a pre-trained base class classifier are combined to form a new classifier. As the training data is dominated by base classes, this approach is inevitably biased towards the base classes. In this work, we propose a novel Prediction Calibration Network (PCN) to address this problem. Instead of fusing the classifier parameters, we fuse the scores produced separately by the base and novel classifiers. To ensure that the fused scores are not biased to either the base or novel classes, a new Transformer-based calibration module is introduced. It is known that the lower-level features are useful of detecting edge information in an input image than higher-level features. Thus, we build a cross-attention module that guides the classifier’s final prediction using the fused multi-level features. However, transformers are computationally demanding. Crucially, to make the proposed cross-attention module training tractable at the pixel level, this module is designed based on feature-score cross-covariance and episodically trained to be generalizable at inference time. Extensive experiments on PASCAL- 5^i and COCO- 20^i show that our PCN outperforms the state-the-the-art alternatives by large margins.

Abstract:
Unsupervised person re-identification is a challenging and promising task in computer vision. Nowadays unsupervised person re-identification methods have achieved great progress by training with pseudo labels. However, how to purify feature and label noise is less explicitly studied in the unsupervised manner. To purify the feature, we take into account two types of additional features from different local views to enrich the feature representation. The proposed multi-view features are carefully integrated into our cluster contrast learning to leverage more discriminative cues that the global feature easily ignored and biased. To purify the label noise, we propose to take advantage of the knowledge of teacher model in an offline scheme. Specifically, we first train a teacher model from noisy pseudo labels, and then use the teacher model to guide the learning of our student model. In our setting, the student model could converge fast with the supervision of the teacher model thus reduce the interference of noisy labels as the teacher model greatly suffered. After carefully handling the noise and bias in the feature learning, our purification modules are proven to be very effective for unsupervised person re-identification. Extensive experiments on two popular person re-identification datasets demonstrate the superiority of our method. Especially, our approach achieves a state-of-the-art accuracy 85.8% @mAP and 94.5% @Rank-1 on the challenging Market-1501 benchmark with ResNet-50 under the fully unsupervised setting. Code has been available at: https://github.com/tengxiao14/Purification_ReID.

Abstract:
Uncertainty is inherent in machine learning methods, especially those for camouflaged object detection aiming to finely segment the objects concealed in background. The strong enquote center bias of the training dataset leads to models of poor generalization ability as the models learn to find camouflaged objects around image center, which we define as enquote model bias. Further, due to the similar appearance of camouflaged object and its surroundings, it is difficult to label the accurate scope of the camouflaged object, especially along object boundaries, which we term as enquote data bias. To effectively model the two types of biases, we resort to uncertainty estimation and introduce predictive uncertainty estimation technique, which is the sum of model uncertainty and data uncertainty, to estimate the two types of biases simultaneously. Specifically, we present a predictive uncertainty estimation network (PUENet) that consists of a Bayesian conditional variational auto-encoder (BCVAE) to achieve predictive uncertainty estimation, and a predictive uncertainty approximation (PUA) module to avoid the expensive sampling process at test-time. Experimental results show that our PUENet achieves both highly accurate prediction, and reliable uncertainty estimation representing the biases within both model parameters and the datasets.

Abstract:
Perspective distortions and crowd variations make crowd counting a challenging task in computer vision. To tackle it, many previous works have used multi-scale architecture in deep neural networks (DNNs). Multi-scale branches can be either directly merged (e.g. by concatenation) or merged through the guidance of proxies (e.g. attentions) in the DNNs. Despite their prevalence, these combination methods are not sophisticated enough to deal with the per-pixel performance discrepancy over multi-scale density maps. In this work, we redesign the multi-scale neural network by introducing a hierarchical mixture of density experts, which hierarchically merges multi-scale density maps for crowd counting. Within the hierarchical structure, an expert competition and collaboration scheme is presented to encourage contributions from all scales; pixel-wise soft gating nets are introduced to provide pixel-wise soft weights for scale combinations in different hierarchies. The network is optimized using both the crowd density map and the local counting map, where the latter is obtained by local integration on the former. Optimizing both can be problematic because of their potential conflicts. We introduce a new relative local counting loss based on relative count differences among hard-predicted local regions in an image, which proves to be complementary to the conventional absolute error loss on the density map. Experiments show that our method achieves the state-of-the-art performance on five public datasets, i.e. ShanghaiTech, UCF_CC_50, JHU-CROWD++, NWPU-Crowd and Trancos. Our codes will be available at https://github.com/ZPDu/Redesigning-Multi-Scale-Neural-Network-for-Crowd-Counting.

Abstract:
Denoising is one of the most significant procedures in the image processing pipeline. Nowadays, deep-learning-based algorithms have achieved superior denoising quality than traditional algorithms. However, the noise becomes severe in the dark environment, where even the SOTA algorithms fail to achieve satisfactory performance. Besides, the high computational complexity of deep-learning-based denoising algorithms makes them hardware unfriendly and difficult to process high-resolution images in real-time. To address these issues, a novel low-light RAW denoising algorithm Two-Stage-Denoising (TSDN), is proposed in this paper. In TSDN, denoising consists of two procedures: noise removal and image restoration. Firstly, in the noise-removal stage, most noise is removed from the image, and an intermediate image that is easier for the network to recover the clean image is obtained. Then, in the restoration stage, the clean image is restored from the intermediate image. The TSDN is designed to be light-weight for real-time and hardware friendly. However, the tiny network will be insufficient for satisfactory performance if directly trained from scratch. Therefore, we present an Expand-Shrink-Learning (ESL) method to train the TSDN. In the ESL method, firstly, the tiny network is expanded to a larger one with similar architecture but more channels and layers, which enhances the learning ability of the network because of more parameters. Secondly, the larger network is shrunk and restored to the original small network in fine-grained learning procedures, including Channel-Shrink-Learning (CSL) and Layer-Shrink-Learning (LSL). Experimental results demonstrate that the proposed TSDN achieves better performance (PSNR and SSIM) than other SOTA algorithms in the dark environment. Besides, the model size of TSDN is one-eighth of that of the U-Net for denoising (a classical denoising network).

Abstract:
In this paper, we propose a discrepancy-aware meta-learning approach for zero-shot face manipulation detection, which aims to learn a discriminative model maximizing the generalization to unseen face manipulation attacks with the guidance of the discrepancy map. Unlike existing face manipulation detection methods that usually present algorithmic solutions to the known face manipulation attacks, where the same types of attacks are used to train and test the models, we define the detection of face manipulation as a zero-shot problem. We formulate the learning of the model as a meta-learning process and generate zero-shot face manipulation tasks for the model to learn the meta-knowledge shared by diversified attacks. We utilize the discrepancy map to keep the model focused on generalized optimization directions during the meta-learning process. We further incorporate a center loss to better guide the model to explore more effective meta-knowledge. Experimental results on the widely used face manipulation datasets demonstrate that our proposed approach achieves very competitive performance under the zero-shot setting.

Abstract:
When characterising a digital camera spectrally or colourimetrically, the camera response to a generally diffusely reflecting colour chart is often employed. The recorded responses to the light incident from each colour patch are typically not linearly related to the power of the irradiance on the chart, and the irradiance varies with position on the chart. This necessitates a linearisation of the responses. We present a new single image colour chart-based estimation method of responses, that are linearly related to camera response values known as ground truth. The method estimates the spatial geometry of the irradiance incident on the chart attenuated by lens vignetting and compensates individually for volumetric and per colour channel non-linearities, including compensation for physical scene and camera properties in a pipeline of successive signal transformations between the estimated linear and the given recorded responses. The estimation is controlled by introducing a novel Additivity Principle of linear responses, which is derived from the spectral reflectances of the coloured surfaces on the colour chart, observing that linear relations of the spectral reflectances are equal to the relations of the corresponding linear responses. Crucially, the additivity principle is not subject to metamerism. The method is fundamentally solely reliant on a one-shot set of one triplet of response values sampled from each patch of a colour chart with known spectral reflectances, where rendition level, gray scale, illuminant, camera sensor curves, irradiance geometry, vignetting, moderate specular reflection, colour space, colour correction, gamut correction and noise level are unknown.

Abstract:
Modern deep neural networks have made numerous breakthroughs in real-world applications, yet they remain vulnerable to some imperceptible adversarial perturbations. These tailored perturbations can severely disrupt the inference of current deep learning-based methods and may induce potential security hazards to artificial intelligence applications. So far, adversarial training methods have achieved excellent robustness against various adversarial attacks by involving adversarial examples during the training stage. However, existing methods primarily rely on optimizing injective adversarial examples correspondingly generated from natural examples, ignoring potential adversaries in the adversarial domain. This optimization bias can induce the overfitting of the suboptimal decision boundary, which heavily jeopardizes adversarial robustness. To address this issue, we propose Adversarial Probabilistic Training (APT) to bridge the distribution gap between the natural and adversarial examples via modeling the latent adversarial distribution. Instead of tedious and costly adversary sampling to form the probabilistic domain, we estimate the adversarial distribution parameters in the feature level for efficiency. Moreover, we decouple the distribution alignment based on the adversarial probability model and the original adversarial example. We then devise a novel reweighting mechanism for the distribution alignment by considering the adversarial strength and the domain uncertainty. Extensive experiments demonstrate the superiority of our adversarial probabilistic training method against various types of adversarial attacks in different datasets and scenarios.

Abstract:
A novel statistical ink drop displacement (IDD) printer model for the direct binary search (DBS) halftoning algorithm is proposed. It is intended primarily for pagewide inkjet printers that exhibit dot displacement errors. The tabular approach in the literature predicts the gray value of a printed pixel based on the halftone pattern in some neighborhood of that pixel. However, memory retrieval time and the complexity of memory requirements hamper its feasibility in printers that have a very large number of nozzles and produce ink drops that affect a large neighborhood. To avoid this problem, our IDD model embodies dot displacements by moving each perceived ink drop in the image from its nominal location to its actual location, rather than manipulating the average gray values. This enables DBS to directly compute the appearance of the final printout without retrieving values from a table. In so doing, the memory issue is eliminated and the computation efficiency is enhanced. The deterministic cost function of DBS is replaced by the expectation over the ensemble of the displacements for the proposed model such that the statistical behavior of the ink drops is accounted for. Experimental results show significant improvement in the quality of the printed image over the original DBS. Besides, the image quality obtained by the proposed approach appears to be slightly better than that obtained by the tabular approach.

Abstract:
Video-language pre-training has attracted considerable attention recently for its promising performance on various downstream tasks. Most existing methods utilize the modality-specific or modality-joint representation architectures for the cross-modality pre-training. Different from previous methods, this paper presents a novel architecture named Memory-augmented Inter-Modality Bridge (MemBridge), which uses the learnable intermediate modality representations as the bridge for the interaction between videos and language. Specifically, in the transformer-based cross-modality encoder, we introduce the learnable bridge tokens as the interaction approach, which means the video and language tokens can only perceive information from bridge tokens and themselves. Moreover, a memory bank is proposed to store abundant modality interaction information for adaptively generating bridge tokens according to different cases, enhancing the capacity and robustness of the inter-modality bridge. Through pre-training, MemBridge explicitly models the representations for more sufficient inter-modality interaction. Comprehensive experiments show that our approach achieves competitive performance with previous methods on various downstream tasks including video-text retrieval, video captioning, and video question answering on multiple datasets, demonstrating the effectiveness of the proposed method. The code has been available at https://github.com/jahhaoyang/MemBridge.

Abstract:
The occluded person re-identification (ReID) aims to match person images captured in severely occluded environments. Current occluded ReID works mostly rely on auxiliary models or employ a part-to-part matching strategy. However, these methods may be sub-optimal since the auxiliary models are constrained by occlusion scenes and the matching strategy will deteriorate when both query and gallery set contain occlusion. Some methods attempt to solve this problem by applying image occlusion augmentation (OA) and have shown great superiority in their effectiveness and lightness. But there are two defects that existed in the previous OA-based method: 1) The occlusion policy is fixed throughout the entire training and cannot be dynamically adjusted based on the current training status of the ReID network. 2) The position and area of the applied OA are completely random, without reference to the image content to choose the most suitable policy. To address these challenges, we propose a novel Content-Adaptive Auto-Occlusion Network (CAAO), that is able to dynamically select the proper occlusion region of an image based on its content and the current training status. Specifically, CAAO consists of two parts: the ReID network and the Auto-Occlusion Controller (AOC) module. AOC automatically generates the optimal OA policy based on the feature map extracted from the ReID network and applies occlusion on the images for ReID network training. An on-policy reinforcement learning based alternating training paradigm is proposed to iteratively update the ReID network and AOC module. Comprehensive experiments on occluded and holistic person ReID benchmarks demonstrate the superiority of CAAO.

Abstract:
Deep neural networks suffer from significant performance deterioration when there exists distribution shift between deployment and training. Domain Generalization (DG) aims to safely transfer a model to unseen target domains by only relying on a set of source domains. Although various DG approaches have been proposed, a recent study named DomainBed (Gulrajani and Lopez-Paz, 2020), reveals that most of them do not beat simple empirical risk minimization (ERM). To this end, we propose a general framework that is orthogonal to existing DG algorithms and could improve their performance consistently. Unlike previous DG works that stake on a static source model to be hopefully a universal one, our proposed AdaODM adaptively modifies the source model at test time for different target domains. Specifically, we create multiple domain-specific classifiers upon a shared domain-generic feature extractor. The feature extractor and classifiers are trained in an adversarial way, where the feature extractor embeds the input samples into a domain-invariant space, and the multiple classifiers capture the distinct decision boundaries that each of them relates to a specific source domain. During testing, distribution differences between target and source domains could be effectively measured by leveraging prediction disagreement among source classifiers. By fine-tuning source models to minimize the disagreement at test time, target-domain features are well aligned to the invariant feature space. We verify AdaODM on two popular DG methods, namely ERM and CORAL, and four DG benchmarks, namely VLCS, PACS, OfficeHome, and TerraIncognita. The results show AdaODM stably improves the generalization capacity on unseen domains and achieves state-of-the-art performance.

Abstract:
Density-based and classification-based methods have ruled unsupervised anomaly detection in recent years, while reconstruction-based methods are rarely mentioned for the poor reconstruction ability and low performance. However, the latter requires no costly extra training samples for the unsupervised training that is more practical, so this paper focuses on improving reconstruction-based method and proposes a novel \boldsymbol O mni-frequency \boldsymbol C hannel-selection \boldsymbol R econstruction (OCR-GAN) network to handle sensory anomaly detection task in a perspective of frequency. Concretely, we propose a Frequency Decoupling (FD) module to decouple the input image into different frequency components and model the reconstruction process as a combination of parallel omni-frequency image restorations, as we observe a significant difference in the frequency distribution of normal and abnormal images. Given the correlation among multiple frequencies, we further propose a Channel Selection (CS) module that performs frequency interaction among different encoders by adaptively selecting different channels. Abundant experiments demonstrate the effectiveness and superiority of our approach over different kinds of methods, e.g., achieving a new state-of-the-art 98.3 detection AUC on the MVTec AD dataset without extra training data that markedly surpasses the reconstruction-based baseline by +38.1 \uparrow and the current SOTA method by +0.3 \uparrow . The source code is available in the additional materials.

Abstract:
The Markov random field (MRF) for stereo matching can be solved using belief propagation (BP). However, the solution space grows significantly with the introduction of high-resolution stereo images and 3D plane labels, making the traditional BP algorithms impractical in inference time and convergence. We present an accurate and efficient hierarchical BP framework using the representation of the image segmentation pyramid (ISP). The pixel-level MRF can be solved by a top-down inference on the ISP. We design a hierarchy of MRF networks using the graph of superpixels at each ISP level. From the highest/image to the lowest/pixel level, the MRF models can be efficiently inferred with constant global guidance using the optimal labels of the previous level. The large texture-less regions can be handled effectively by the MRF model on a high level. The advanced 3D continuous labels and a novel support-points regularization are integrated into our framework for stereo matching. We provide a data-level parallelism implementation which is orders of magnitude faster than the best graph cuts (GC) algorithm. The proposed framework, HBP-ISP, outperforms the best GC algorithm on the Middlebury stereo matching benchmark.

Affiliations: School of Communication and Electronic Engineering, East China Normal University, Shanghai, China; School of Automation, Xi’an Jiao Tong University, Xi’an, China; School of Computing and Mathematical Sciences, University of Leicester, Leicester, U.K; Faculty of Engineering, School of Computer Science, The University of Sydney, Darlington, NSW, Australia; Key Laboratory of Intelligent Interaction and Applications, Ministry of Industry and Information Technology, and the School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, Xi’an, China

Abstract:
Deep generative models have demonstrated successful applications in learning non-linear data distributions through a number of latent variables and these models use a non-linear function (generator) to map latent samples into the data space. On the other hand, the non-linearity of the generator implies that the latent space shows an unsatisfactory projection of the data space, which results in poor representation learning. This weak projection, however, can be addressed by a Riemannian metric, and we show that geodesics computation and accurate interpolations between data samples on the Riemannian manifold can substantially improve the performance of deep generative models. In this paper, a Variational spatial-Transformer AutoEncoder (VTAE) is proposed to minimize geodesics on a Riemannian manifold and improve representation learning. In particular, we carefully design the variational autoencoder with an encoded spatial-Transformer to explicitly expand the latent variable model to data on a Riemannian manifold, and obtain global context modelling. Moreover, to have smooth and plausible interpolations while traversing between two different objects’ latent representations, we propose a geodesic interpolation network different from the existing models that use linear interpolation with inferior performance. Experiments on benchmarks show that our proposed model can improve predictive accuracy and versatility over a range of computer vision tasks, including image interpolations, and reconstructions.

Abstract:
Person re-identification (re-ID) aims to match the same person across different cameras. However, most existing re-ID methods assume that people wear the same clothes in different views, which limit their performance in identifying target pedestrians who change clothes. Cloth-changing re-ID is a quite challenging problem as clothes occupying a large number of pixels in an image becomes invalid or even misleads information. To tackle this problem, we propose a novel Multi-biometric Unified Network (MBUNet) for learning the robustness of cloth-changing re-ID model by exploiting clothing-independent cues. Specifically, we first introduce a multi-biological feature branch to extract a variety of biological features, such as the head, neck, and shoulders to resist cloth-changing. Then, a differential feature attention module (DFAM) is embedded in this branch, which can extract discriminative fine-grained biological features. Besides, we design a differential recombination on max pooling (DRMP) strategy and simultaneously apply a direction-adaptive graph convolutional layer to mine more robust global and pose features. Finally, we propose a Lightweight Domain Adaptation Module (LDAM) that combines the attention mechanism before and after the waveblock to capture and enhance transferable features across scenarios. To further improve the performance of the model, we also integrate mAP optimization into the objective function of our model for joint training to solve the discrete optimization problem of mAP. Extensive experiments on five cloth-changing re-ID datasets demonstrate the advantages of our proposed MBUNet. The code is available at https://github.com/liyeabc/MBUNet.

Affiliations: School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China; School of Computer Science, The University of Adelaide, Adelaide, SA, Australia; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; School of Engineering, Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, WA, Australia; Computer Science and Software Engineering, The University of Western Australia, Perth, WA, Australia; Department of Computer Science, Swansea University, Swansea, U.K.

Abstract:
Cross-resolution person re-identification (CRReID) is a challenging and practical problem that involves matching low-resolution (LR) query identity images against high-resolution (HR) gallery images. Query images often suffer from resolution degradation due to the different capturing conditions from real-world cameras. State-of-the-art solutions for CRReID either learn a resolution-invariant representation or adopt a super-resolution (SR) module to recover the missing information from the LR query. In this paper, we propose an alternative SR-free paradigm to directly compare HR and LR images via a dynamic metric that is adaptive to the resolution of a query image. We realize this idea by learning resolution-adaptive representations for cross-resolution comparison. We propose two resolution-adaptive mechanisms to achieve this. The first mechanism encodes the resolution specifics into different subvectors in the penultimate layer of the deep neural network, creating a varying-length representation. To better extract resolution-dependent information, we further propose to learn resolution-adaptive masks for intermediate residual feature blocks. A novel progressive learning strategy is proposed to train those masks properly. These two mechanisms are combined to boost the performance of CRReID. Experimental results show that the proposed method outperforms existing approaches and achieves state-of-the-art performance on multiple CRReID benchmarks.

Affiliations: Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China; Guangdong Key Laboratory of Intelligent Information Processing and the Shenzhen Key Laboratory of Advanced Machine Learning and Applications, College of Electronics and Information Engineering, Institute of Artificial Intelligence and Advanced Communication, Shenzhen University, Shenzhen, China; College of Mathematics and Statistics, Shenzhen University, Shenzhen, China

Abstract:
Semantic segmentation assigns a category for each pixel and has achieved great success in a supervised manner. However, it fails to generalize well in new domains due to the domain gap. Domain adaptation is a popular way to solve this issue, but it needs target data and cannot handle unavailable domains. In domain generalization (DG), the model is trained without the target data and DG aims to generalize well in new unavailable domains. Recent works reveal that shape recognition is beneficial for generalization but still lack exploration in semantic segmentation. Meanwhile, the object shapes also exist a discrepancy in different domains, which is often ignored by the existing works. Thus, we propose a Shape-Invariant Learning (SIL) framework to focus on learning shape-invariant representation for better generalization. Specifically, we first define the structural edge, which considers both the object boundary and the inner structure of the object to provide more discrimination cues. Then, a shape perception learning strategy including a texture feature discrepancy reduction loss and a structural feature discrepancy enlargement loss is proposed to enhance the shape perception ability of the model by embedding the structural edge as a shape prior. Finally, we use shape deformation augmentation to generate samples with the same content and different shapes. Essentially, our SIL framework performs implicit shape distribution alignment at the domain-level to learn shape-invariant representation. Extensive experiments show that our SIL framework achieves state-of-the-art performance.

Abstract:
Multi-modal reasoning, which aims to capture logical and causal structures in visual content and associate them with cues from other modality inputs (e.g., texts) to perform various types of reasoning, is an important research topic in artificial intelligence (AI). Existing works for multi-modal reasoning mainly exploit offline learning, where the training samples of all types of reasoning tasks are assumed to be available at once. Here we focus on continual learning for multi-modal reasoning (i.e., continual multi-modal reasoning), where the model is required to continuously learn to solve novel types of multi-modal reasoning tasks in a lifelong fashion. Continual multi-modal reasoning is challenging since the model needs to be able to effectively learn various types of new reasoning tasks, meanwhile avoiding forgetting. Here we propose a novel brain-inspired exp erts \textco llaboration network (Expo), which incorporates multiple learning blocks (experts). When encountering a new task, our network dynamically assembles and updates a set of task-specific experts that are most relevant to learning the current task, by either utilizing learned experts or exploring new experts. This thus enables effective learning of new tasks, and meanwhile consolidates previously learned reasoning skills. Moreover, to automatically find optimal task-specific experts, an effective experts selection strategy is designed. Extensive experiments demonstrate the efficacy of our model for continual multi-modal reasoning.

Abstract:
Tensor Robust Principal Component Analysis (TRPCA), which aims to recover the low-rank and sparse components from their sum, has drawn intensive interest in recent years. Most existing TRPCA methods adopt the tensor nuclear norm (TNN) and the tensor \ell _1 norm as the regularization terms for the low-rank and sparse components, respectively. However, TNN treats each singular value of the low-rank tensor \boldsymbol \mathcal L equally and the tensor \ell _1 norm shrinks each entry of the sparse tensor \boldsymbol \mathcal S with the same strength. It has been shown that larger singular values generally correspond to prominent information of the data and should be less penalized. The same goes for large entries in \boldsymbol \mathcal S in terms of absolute values. In this paper, we propose a Double Auto-weighted TRPCA (DATRPCA) method. s Instead of using predefined and manually set weights merely for the low-rank tensor as previous works, DATRPCA automatically and adaptively assigns smaller weights and applies lighter penalization to significant singular values of the low-rank tensor and large entries of the sparse tensor simultaneously. We have further developed an efficient algorithm to implement DATRPCA based on the Alternating Direction Method of Multipliers (ADMM) framework. In addition, we have also established the convergence analysis of the proposed algorithm. The results on both synthetic and real-world data demonstrate the effectiveness of DATRPCA for low-rank tensor recovery, color image recovery and background modelling.

Abstract:
Multiview clustering (MVC) aims to partition data into different groups by taking full advantage of the complementary information from multiple views. Most existing MVC methods fuse information of multiple views at the raw data level. They may suffer from performance degradation due to the redundant information contained in the raw data. Graph learning-based methods often heavily depend on one specific graph construction, which limits their practical applications. Moreover, they often require a computational complexity of \mathcal O\left (n^3 \right) because of matrix inversion or eigenvalue decomposition for each iterative computation. In this paper, we propose a consensus spectral rotation fusion (CSRF) method to learn a fused affinity matrix for MVC at the spectral embedding feature level. Specifically, we first introduce a CSRF model to learn a consensus low-dimensional embedding, which explores the complementary and consistent information across multiple views. We develop an alternating iterative optimization algorithm to solve the CSRF optimization problem, where a computational complexity of \mathcal O\left (n^2 \right) is required during each iterative computation. Then, the sparsity policy is introduced to design two different graph construction schemes, which are effectively integrated with the CSRF model. Finally, a multiview fused affinity matrix is constructed from the consensus low-dimensional embedding in spectral embedding space. We analyze the convergence of the alternating iterative optimization algorithm and provide an extension of CSRF for incomplete MVC. Extensive experiments on multiview datasets demonstrate the effectiveness and efficiency of the proposed CSRF method.

Abstract:
There has been a growing interest in counting crowds through computer vision and machine learning techniques in recent years. Despite that significant progress has been made, most existing methods heavily rely on fully-supervised learning and require a lot of labeled data. To alleviate the reliance, we focus on the semi-supervised learning paradigm. Usually, crowd counting is converted to a density estimation problem. The model is trained to predict a density map and obtains the total count by accumulating densities over all the locations. In particular, we find that there could be multiple density map representations for a given image in a way that they differ in probability distribution forms but reach a consensus on their total counts. Therefore, we propose multiple representation learning to train several models. Each model focuses on a specific density representation and utilizes the count consistency between models to supervise unlabeled data. To bypass the explicit density regression problem, which makes a strong parametric assumption on the underlying density distribution, we propose an implicit density representation method based on the kernel mean embedding. Extensive experiments demonstrate that our approach outperforms state-of-the-art semi-supervised methods significantly.

Affiliations: School of Engineering Science, University of Science and Technology of China, Hefei, China; School of Cyber Science and Engineering, University of Science and Technology of China, Hefei, China; Department of Precision Machinery and Precision Instruments and the Innovation Laboratory of WuHu State-Owned Factory of Machining, University of Science and Technology of China, Hefei, China; Department of Computer Science, University of Science and Technology of China, Hefei, China

Abstract:
Recently, learning-based multi-exposure fusion (MEF) methods have made significant improvements. However, these methods mainly focus on static scenes and are prone to generate ghosting artifacts when tackling a more common scenario, i.e., the input images include motion, due to the lack of a benchmark dataset and solution for dynamic scenes. In this paper, we fill this gap by creating an MEF dataset of dynamic scenes, which contains multi-exposure image sequences and their corresponding high-quality reference images. To construct such a dataset, we propose a ‘static-for-dynamic’ strategy to obtain multi-exposure sequences with motions and their corresponding reference images. To the best of our knowledge, this is the first MEF dataset of dynamic scenes. Correspondingly, we propose a deep dynamic MEF (DDMEF) framework to reconstruct a ghost-free high-quality image from only two differently exposed images of a dynamic scene. DDMEF is achieved through two steps: pre-enhancement-based alignment and privilege-information-guided fusion. The former pre-enhances the input images before alignment, which helps to address the misalignments caused by the significant exposure difference. The latter introduces a privilege distillation scheme with an information attention transfer loss, which effectively improves the deghosting ability of the fusion network. Extensive qualitative and quantitative experimental results show that the proposed method outperforms state-of-the-art dynamic MEF methods. The source code and dataset are released at https://github.com/Tx000/Deep_dynamicMEF.

Abstract:
Multi-shot coded aperture snapshot spectral imaging (CASSI) uses multiple measurement snapshots to encode the three-dimensional hyperspectral image (HSI). Increasing the number of snapshots will multiply the number of measurements, making CASSI system more appropriate for detailed spatial or spectrally rich scenes. However, the reconstruction algorithms still face the challenge of being ineffective or inflexible. In this paper, we propose a plug-and-play (PnP) method that uses denoiser as priors for multi-shot CASSI. Specifically, the proposed PnP method is based on the primal-dual algorithm with linesearch (PDAL), which makes it flexible and can be used for any multi-shot CASSI mechanisms. Furthermore, a new subspaced-based nonlocal reweighted low-rank (SNRL) denoiser is presented to utilize the global spectral correlation and nonlocal self-similarity priors of HSI. By integrating the SNRL denoiser into PnP-PDAL, we show the balloons ( 512× 512×31 ) in CAVE dataset recovered from two snapshots compressive measurements with MPSNR above 50 dB. Experimental results demonstrate that our proposed method leads to significant improvements compared to the current state-of-the-art methods.

Affiliations: Department of Computer Science and Engineering and the AI Institute, Shanghai Jiao Tong University, Shanghai, China; Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Wangxuan Institute of Computer Technology (WICT), Peking University, Beijing, China; National Engineering Research Center of Visual Technology, School of Computer Science, Peking University, Beijing, China; Institute of Image Communication and Network Engineering, AI Institute, Shanghai Jiao Tong University, Shanghai, China

Abstract:
Learned image compression methods have achieved satisfactory results in recent years. However, existing methods are typically designed for RGB format, which are not suitable for YUV420 format due to the variance of different formats. In this paper, we propose an information-guided compression framework using cross-component attention mechanism, which can achieve efficient image compression in YUV420 format. Specifically, we design a dual-branch advanced information-preserving module (AIPM) based on the information-guided unit (IGU) and attention mechanism. On the one hand, the dual-branch architecture can prevent changes in original data distribution and avoid information disturbance between different components. The feature attention block (FAB) can preserve the important information. On the other hand, IGU can efficiently utilize the correlations between Y and UV components, which can further preserve the information of UV by the guidance of Y. Furthermore, we design an adaptive cross-channel enhancement module (ACEM) to reconstruct the details by utilizing the relations from different components, which makes use of the reconstructed Y as the textural and structural guidance for UV components. Extensive experiments show that the proposed framework can achieve the state-of-the-art performance in image compression for YUV420 format. More importantly, the proposed framework outperforms Versatile Video Coding (VVC) with 8.37% BD-rate reduction on common test conditions (CTC) sequences on average. In addition, we propose a quantization scheme for context model without model retraining, which can overcome the cross-platform decoding error caused by the floating-point operations in context model and provide a reference approach for the application of neural codec on different platforms.

Abstract:
Compared to unsupervised domain adaptation, semi-supervised domain adaptation (SSDA) aims to significantly improve the classification performance and generalization capability of the model by leveraging the presence of a small amount of labeled data from the target domain. Several SSDA approaches have been developed to enable semantic-aligned feature confusion between labeled (or pseudo labeled) samples across domains; nevertheless, owing to the scarcity of semantic label information of the target domain, they were arduous to fully realize their potential. In this study, we propose a novel SSDA approach named Graph-based Adaptive Betweenness Clustering (G-ABC) for achieving categorical domain alignment, which enables cross-domain semantic alignment by mandating semantic transfer from labeled data of both the source and target domains to unlabeled target samples. In particular, a heterogeneous graph is initially constructed to reflect the pairwise relationships between labeled samples from both domains and unlabeled ones of the target domain. Then, to degrade the noisy connectivity in the graph, connectivity refinement is conducted by introducing two strategies, namely Confidence Uncertainty based Node Removal and Prediction Dissimilarity based Edge Pruning. Once the graph has been refined, Adaptive Betweenness Clustering is introduced to facilitate semantic transfer by using across-domain betweenness clustering and within-domain betweenness clustering, thereby propagating semantic label information from labeled samples across domains to unlabeled target data. Extensive experiments on three standard benchmark datasets, namely DomainNet, Office-Home, and Office-31, indicated that our method outperforms previous state-of-the-art SSDA approaches, demonstrating the superiority of the proposed G-ABC algorithm.

Abstract:
Semi-supervised video object segmentation is the task of segmenting the target in sequential frames given the ground truth mask in the first frame. The modern approaches usually utilize such a mask as pixel-level supervision and typically exploit pixel-to-pixel matching between the reference frame and current frame. However, the matching at pixel level, which overlooks the high-level information beyond local areas, often suffers from confusion caused by similar local appearances. In this paper, we present Prototypical Matching Networks (PMNet) - a novel architecture that integrates prototypes into matching-based video objection segmentation frameworks as high-level supervision. Specifically, PMNet first divides the foreground and background areas into several parts according to the similarity to the global prototypes. The part-level prototypes and instance-level prototypes are generated by encapsulating the semantic information of identical parts and identical instances, respectively. To model the correlation between prototypes, the prototype representations are propagated to each other by reasoning on a graph structure. Then, PMNet stores both the pixel-level features and prototypes in the memory bank as the target cues. Three affinities, i.e., pixel-to-pixel affinity, prototype-to-pixel affinity, and prototype-to-prototype affinity, are derived to measure the similarity between the query frame and the features in the memory bank. The features aggregated from the memory bank using these affinities provide powerful discrimination from both the pixel-level and prototype-level perspectives. Extensive experiments conducted on four benchmarks demonstrate superior results than the state-of-the-art video object segmentation techniques.

Abstract:
The long-tailed distribution is a common phenomenon in the real world. Extracted large scale image datasets inevitably demonstrate the long-tailed property and models trained with imbalanced data can obtain high performance for the over-represented categories, but struggle for the under-represented categories, leading to biased predictions and performance degradation. To address this challenge, we propose a novel de-biasing method named Inverse Image Frequency (IIF). IIF is a multiplicative margin adjustment transformation of the logits in the classification layer of a convolutional neural network. Our method achieves stronger performance than similar works and it is especially useful for downstream tasks such as long-tailed instance segmentation as it produces fewer false positive detections. Our extensive experiments show that IIF surpasses the state of the art on many long-tailed benchmarks such as ImageNet-LT, CIFAR-LT, Places-LT and LVIS, reaching 55.8% top-1 accuracy with ResNet50 on ImageNet-LT and 26.3% segmentation AP with MaskRCNN ResNet50 on LVIS. Code available at https://github.com/kostas1515/iif

Abstract:
Despite remarkable success in a variety of computer vision applications, it is well-known that deep learning can fail catastrophically when presented with out-of-distribution data, where there are usually style differences between the training and test images. Toward addressing this challenge, we consider the domain generalization problem, wherein predictors are trained using data drawn from a family of related training (source) domains and then evaluated on a distinct and unseen test domain. Naively training a model on the aggregate set of data (pooled from all source domains) has been shown to perform suboptimally, since the information learned by that model might be domain-specific and generalizes imperfectly to test domains. Data augmentation has been shown to be an effective approach to overcome this problem. However, its application has been limited to enforcing invariance to simple transformations like rotation, brightness change, etc. Such perturbations do not necessarily cover plausible real-world variations that preserve the semantics of the input (such as a change in the image style). In this paper, taking the advantage of multiple source domains, we propose a novel approach to express and formalize robustness to these kind of real-world image perturbations. The three key ideas underlying our formulation are (1) leveraging disentangled representations of the images to define different factors of variations, (2) generating perturbed images by changing such factors composing the representations of the images, (3) enforcing the learner (classifier) to be invariant to such changes in the images. We use image-to-image translation models to demonstrate the efficacy of this approach. Based on this, we propose a domain-invariant regularization (DIR) loss function that enforces invariant prediction of targets (class labels) across domains which yields improved generalization performance. We demonstrate the effectiveness of our approach on several widely used datasets for the domain generalization problem, on all of which our results are competitive with the state-of-the-art.

Affiliations: School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Department of Computer Science and Technology, Yangzhou University, Yangzhou, China; School of Computer and Electronic Information, Nanjing Normal University, Nanjing, China; State Key Laboratory of High Performance Computing and the Institute for Quantum Information, National University of Defense Technology, Changsha, China; College of Computer, Qinghai Normal University, Xining, China

Abstract:
Video hashing learns compact representation by mapping video into low-dimensional Hamming space and has achieved promising performance in large-scale video retrieval. It is challenging to effectively exploit temporal and spatial structure in an unsupervised setting. To fulfill this gap, this paper proposes Contrastive Transformer Hashing (CTH) for effective video retrieval. Specifically, CTH develops a bidirectional transformer autoencoder, based on which visual reconstruction loss is proposed. CTH is more powerful to capture bidirectional correlations among frames than conventional unidirectional models. In addition, CTH devises multi-modality contrastive loss to reveal intrinsic structure among videos. CTH constructs inter-modality and intra-modality triplet sets and proposes multi-modality contrastive loss to exploit inter-modality and intra-modality similarities simultaneously. We perform video retrieval tasks on four benchmark datasets, i.e., UCF101, HMDB51, SVW30, FCVID using the learned compact hash representation, and extensive empirical results demonstrate the proposed CTH outperforms several state-of-the-art video hashing methods.

Abstract:
While adversarial training and its variants have shown to be the most effective algorithms to defend against adversarial attacks, their extremely slow training process makes it hard to scale to large datasets like ImageNet. The key idea of recent works to accelerate adversarial training is to substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g., FGSM). However, these single-step methods suffer from catastrophic overfitting, where the accuracy against PGD attack suddenly drops to nearly 0% during training, and the network totally loses its robustness. In this work, we study the phenomenon from the perspective of training instances. We show that catastrophic overfitting is instance-dependent, and fitting instances with larger input gradient norm is more likely to cause catastrophic overfitting. Based on our findings, we propose a simple but effective method, Adversarial Training with Adaptive Step size (ATAS). ATAS learns an instance-wise adaptive step size that is inversely proportional to its gradient norm. Our theoretical analysis shows that ATAS converges faster than the commonly adopted non-adaptive counterparts. Empirically, ATAS consistently mitigates catastrophic overfitting and achieves higher robust accuracy on CIFAR10, CIFAR100, and ImageNet when evaluated on various adversarial budgets. Our code is released at https://github.com/HuangZhiChao95/ATAS.

Abstract:
Out-of-distribution (OOD) detection aims to detect “unknown” data whose labels have not been seen during the in-distribution (ID) training process. Recent progress in representation learning gives rise to distance-based OOD detection that recognizes inputs as ID/OOD according to their relative distances to the training data of ID classes. Previous approaches calculate pairwise distances relying only on global image representations, which can be sub-optimal as the inevitable background clutter and intra-class variation may drive image-level representations from the same ID class far apart in a given representation space. In this work, we overcome this challenge by proposing Multi-scale OOD DEtection (MODE), a first framework leveraging both global visual information and local region details of images to maximally benefit OOD detection. Specifically, we first find that existing models pretrained by off-the-shelf cross-entropy or contrastive losses are incompetent to capture valuable local representations for MODE, due to the scale-discrepancy between the ID training and OOD detection processes. To mitigate this issue and encourage locally discriminative representations in ID training, we propose Attention-based Local PropAgation ( \mathtt ALPA ), a trainable objective that exploits a cross-attention mechanism to align and highlight the local regions of the target objects for pairwise examples. During test-time OOD detection, a Cross-Scale Decision ( \mathtt CSD ) function is further devised on the most discriminative multi-scale representations to distinguish ID/OOD data more faithfully. We demonstrate the effectiveness and flexibility of MODE on several benchmarks – on average, MODE outperforms the previous state-of-the-art by up to 19.24% in FPR, 2.77% in AUROC. Code is available at https://github.com/JimZAI/MODE-OOD.

Abstract:
Pseudo supervision is regarded as the core idea in semi-supervised learning for semantic segmentation, and there is always a tradeoff between utilizing only the high-quality pseudo labels and leveraging all the pseudo labels. Addressing that, we propose a novel learning approach, called Conservative-Progressive Collaborative Learning (CPCL), among which two predictive networks are trained in parallel, and the pseudo supervision is implemented based on both the agreement and disagreement of the two predictions. One network seeks common ground via intersection supervision and is supervised by the high-quality labels to ensure a more reliable supervision, while the other network reserves differences via union supervision and is supervised by all the pseudo labels to keep exploring with curiosity. Thus, the collaboration of conservative evolution and progressive exploration can be achieved. To reduce the influences of the suspicious pseudo labels, the loss is dynamic re-weighted according to the prediction confidence. Extensive experiments demonstrate that CPCL achieves state-of-the-art performance for semi-supervised semantic segmentation.

Abstract:
Existing supervised quantization methods usually learn the quantizers from pair-wise, triplet, or anchor-based losses, which only capture their relationship locally without aligning them globally. This may cause an inadequate use of the entire space and a severe intersection among different semantics, leading to inferior retrieval performance. Furthermore, to enable quantizers to learn in an end-to-end way, current practices usually relax the non-differentiable quantization operation by substituting it with softmax, which unfortunately is biased, leading to an unsatisfying suboptimal solution. To address the above issues, we present Spherical Centralized Quantization (SCQ), which contains a Priori Knowledge based Feature (PKFA) module for the global alignment of feature vectors, and an Annealing Regulation Semantic Quantization (ARSQ) module for low-biased optimization. Specifically, the PKFA module first applies Semantic Center Allocation (SCA) to obtain semantic centers based on prior knowledge, and then adopts Centralized Feature Alignment (CFA) to gather feature vectors based on corresponding semantic centers. The SCA and CFA globally optimize the inter-class separability and intra-class compactness, respectively. After that, the ARSQ module performs a partial-soft relaxation to tackle biases, and an Annealing Regulation Quantization loss for further addressing the local optimal solution. Experimental results show that our SCQ outperforms state-of-the-art algorithms by a large margin (2.1%, 3.6%, 5.5% mAP respectively) on CIFAR-10, NUS-WIDE, and ImageNet with a code length of 8 bits. Codes are publicly available:https://github.com/zzb111/Spherical-Centralized-Quantization.

Abstract:
Domain generalization aims to learn knowledge invariant across different distributions while semantically meaningful for downstream tasks from multiple source domains, to improve the model’s generalization ability on unseen target domains. The fundamental objective is to understand the underlying ”invariance” behind these observational distributions and such invariance has been shown to have a close connection to causality. While many existing approaches make use of the property that causal features are invariant across domains, we consider the invariance of the average causal effect of the features to the labels. This invariance regularizes our training approach in which interventions are performed on features to enforce stability of the causal prediction by the classifier across domains. Our work thus sheds some light on the domain generalization problem by introducing invariance of the mechanisms into the learning process. Experiments on several benchmark datasets demonstrate the performance of the proposed method against SOTAs. The codes are available at: https://github.com/lithostark/Contrastive-ACE.

Abstract:
Accurate retinal fluid segmentation on Optical Coherence Tomography (OCT) images plays an important role in diagnosing and treating various eye diseases. The art deep models have shown promising performance on OCT image segmentation given pixel-wise annotated training data. However, the learned model will achieve poor performance on OCT images that are obtained from different devices (domains) due to the domain shift issue. This problem largely limits the real-world application of OCT image segmentation since the types of devices usually are different in each hospital. In this paper, we study the task of cross-domain OCT fluid segmentation, where we are given a labeled dataset of the source device (domain) and an unlabeled dataset of the target device (domain). The goal is to learn a model that can perform well on the target domain. To solve this problem, in this paper, we propose a novel Structure-guided Cross-Attention Network (SCAN), which leverages the retinal layer structure to facilitate domain alignment. Our SCAN is inspired by the fact that the retinal layer structure is robust to domains and can reflect regions that are important to fluid segmentation. In light of this, we build our SCAN in a multi-task manner by jointly learning the retinal structure prediction and fluid segmentation. To exploit the mutual benefit between layer structure and fluid segmentation, we further introduce a cross-attention module to measure the correlation between the layer-specific feature and the fluid-specific feature encouraging the model to concentrate on highly relative regions during domain alignment. Moreover, an adaptation difficulty map is evaluated based on the retinal structure predictions from different domains, which enforces the model focus on hard regions during structure-aware adversarial learning. Extensive experiments on the three domains of the RETOUCH dataset demonstrate the effectiveness of the proposed method and show that our approach produces state-of-the-art performance on cross-domain OCT fluid segmentation.

Abstract:
In this paper, we introduce a variational Bayesian algorithm (VBA) for image blind deconvolution. Our VBA generic framework incorporates smoothness priors on the unknown blur/image and possible affine constraints (e.g., sum to one) on the blur kernel, integrating the VBA within a neural network paradigm following an unrolling methodology. The proposed architecture is trained in a supervised fashion, which allows us to optimally set two key hyperparameters of the VBA model and leads to further improvements in terms of resulting visual quality. Various experiments involving grayscale/color images and diverse kernel shapes, are performed. The numerical examples illustrate the high performance of our approach when compared to state-of-the-art techniques based on optimization, Bayesian estimation, or deep learning.

Abstract:
Occluded person re-identification (ReID) is a challenging task due to more background noises and incomplete foreground information. Although existing human parsing-based ReID methods can tackle this problem with semantic alignment at the finest pixel level, their performance is heavily affected by the human parsing model. Most supervised methods propose to train an extra human parsing model aside from the ReID model with cross-domain human parts annotation, suffering from expensive annotation cost and domain gap; Unsupervised methods integrate a feature clustering-based human parsing process into the ReID model, but lacking supervision signals brings less satisfactory segmentation results. In this paper, we argue that the pre-existing information in the ReID training dataset can be directly used as supervision signals to train the human parsing model without any extra annotation. By integrating a weakly supervised human co-parsing network into the ReID network, we propose a novel framework that exploits shared information across different images of the same pedestrian, called the Human Co-parsing Guided Alignment (HCGA) framework. Specifically, the human co-parsing network is weakly supervised by three consistency criteria, namely global semantics, local space, and background. By feeding the semantic information and deep features from the person ReID network into the guided alignment module, features of the foreground and human parts can then be obtained for effective occluded person ReID. Experiment results on two occluded and two holistic datasets demonstrate the superiority of our method. Especially on Occluded-DukeMTMC, it achieves 70.2% Rank-1 accuracy and 57.5% mAP.

Abstract:
Image-text matching is a challenging task due to the modality gap. Many recent methods focus on modeling entity relationships to learn a common embedding space of image and text. However, these methods suffer from distractions of entity relationships such as irrelevant visual regions in an image and noisy textual words in a text. In this paper, we propose an adaptive latent graph representation learning method to reduce the distractions of entity relationships for image-text matching. Specifically, we use an improved graph variational autoencoder to separate the distracting factors and latent factor of relationships and jointly learn latent textual graph representations, latent visual graph representations, and a visual-textual graph embedding space. We also introduce an adaptive cross-attention mechanism to perform feature attending on the latent graph representations across images and texts, thus further narrowing the modality gap to boost the matching performance. Extensive experiments on two public datasets, Flickr30K and COCO, show the effectiveness of our method.

Abstract:
Generalizable person Re-Identification (ReID) aims to learn ready-to-use cross-domain representations for direct cross-data evaluation, which has attracted growing attention in the recent computer vision (CV) community. In this work, we construct a structural causal model (SCM) among identity labels, identity-specific factors (clothing/shoes color etc.), and domain-specific factors (background, viewpoints etc.). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we propose to disentangle the identity-specific and domain-specific factors into two independent feature spaces, based on which an effective backdoor adjustment approximate implementation is proposed for serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art (SOTA) methods on large-scale domain generalization (DG) ReID benchmarks.

Abstract:
Unlike the success of neural architecture search (NAS) in high-level vision tasks, it remains challenging to find computationally efficient and memory-efficient solutions to low-level vision problems such as image restoration through NAS. One of the fundamental barriers to differential NAS-based image restoration is the optimization gap between the super-network and the sub-architectures, causing instability during the searching process. In this paper, we present a novel approach to fill this gap in image denoising application by connecting model-guided design (MoD) with NAS (MoD-NAS). Specifically, we propose to construct a new search space under a model-guided framework and develop more stable and efficient differential search strategies. MoD-NAS employs a highly reusable width search strategy and a densely connected search block to automatically select the operations of each layer as well as network width and depth via gradient descent. During the search process, the proposed MoD-NAS remains stable because of the smoother search space designed under the model-guided framework. Experimental results on several popular datasets show that our MoD-NAS method has achieved at least comparable even better PSNR performance than current state-of-the-art methods with fewer parameters, fewer flops, and less testing time. “The code associate with this paper is available at: https://see.xidian.edu.cn/faculty/wsdong/Projects/Mod-NAS.htm”.

Abstract:
Video panoptic segmentation is an important but challenging task in computer vision. It not only performs panoptic segmentation of each frame, but also associates the same instance across adjacent frames. Due to the lack of temporal coherence modeling, most existing approaches often generate identity switches during instance association, and they cannot handle ambiguous segmentation boundaries caused by motion blur. To address these difficult issues, we introduce a simple yet effective Instance Motion Tendency Network (IMTNet) for video panoptic segmentation. It learns a global motion tendency map for instance association, and a hierarchical classifier for motion boundary refinement. Specifically, a Global Motion Tendency Module (GMTM) is designed to learn robust motion features from optical flows, which can directly associate each instance in the previous frame to the corresponding instance in the current frame. In addition, we propose a Motion Boundary Refinement Module (MBRM) to learn a hierarchical classifier to handle the boundary pixels of moving targets, which can effectively revise the inaccurate segmentation predictions. Experimental results on both Cityscapes and Cityscapes-VPS datasets show that our IMTNet outperforms most state-of-the-art approaches.

Abstract:
Nucleus segmentation is a challenging task due to the crowded distribution and blurry boundaries of nuclei. Recent approaches represent nuclei by means of polygons to differentiate between touching and overlapping nuclei and have accordingly achieved promising performance. Each polygon is represented by a set of centroid-to-boundary distances, which are in turn predicted by features of the centroid pixel for a single nucleus. However, using the centroid pixel alone does not provide sufficient contextual information for robust prediction and thus degrades the segmentation accuracy. To handle this problem, we propose a Context-aware Polygon Proposal Network (CPP-Net) for nucleus segmentation. First, we sample a point set rather than one single pixel within each cell for distance prediction. This strategy substantially enhances contextual information and thereby improves the robustness of the prediction. Second, we propose a Confidence-based Weighting Module, which adaptively fuses the predictions from the sampled point set. Third, we introduce a novel Shape-Aware Perceptual (SAP) loss that constrains the shape of the predicted polygons. Here, the SAP loss is based on an additional network that is pre-trained by means of mapping the centroid probability map and the pixel-to-boundary distance maps to a different nucleus representation. Extensive experiments justify the effectiveness of each component in the proposed CPP-Net. Finally, CPP-Net is found to achieve state-of-the-art performance on three publicly available databases, namely DSB2018, BBBC06, and PanNuke. Code of this paper is available at https://github.com/csccsccsccsc/cpp-net.

Abstract:
Salient Object Detection has boomed in recent years and achieved impressive performance on regular-scale targets. However, existing methods encounter performance bottlenecks in processing objects with scale variation, especially extremely large- or small-scale objects with asymmetric segmentation requirements, since they are inefficient in obtaining more comprehensive receptive fields. With this issue in mind, this paper proposes a framework named BBRF for Boosting Broader Receptive Fields, which includes a Bilateral Extreme Stripping (BES) encoder, a Dynamic Complementary Attention Module (DCAM) and a Switch-Path Decoder (SPD) with a new boosting loss under the guidance of Loop Compensation Strategy (LCS). Specifically, we rethink the characteristics of the bilateral networks, and construct a BES encoder that separates semantics and details in an extreme way so as to get the broader receptive fields and obtain the ability to perceive extreme large- or small-scale objects. Then, the bilateral features generated by the proposed BES encoder can be dynamically filtered by the newly proposed DCAM. This module interactively provides spacial-wise and channel-wise dynamic attention weights for the semantic and detail branches of our BES encoder. Furthermore, we subsequently propose a Loop Compensation Strategy to boost the scale-specific features of multiple decision paths in SPD. These decision paths form a feature loop chain, which creates mutually compensating features under the supervision of boosting loss. Experiments on five benchmark datasets demonstrate that the proposed BBRF has a great advantage to cope with scale variation and can reduce the Mean Absolute Error over 20% compared with the state-of-the-art methods.

Abstract:
Few-shot object detection (FSOD) aims to adapt generic detectors to the novel categories with only a few annotations, which is an important and realistic task. Although the generic object detection has been widely studied over the past years, the FSOD is under explored. In this paper, we propose a novel Category Knowledge-guided Parameter Calibration (CKPC) framework to solve the FSOD task. We first propagate the category relation information to explore the representative category knowledge. Then, we explore the RoI-RoI and RoI-Category relations to capture the local-global context information to enhance the RoI (Region of Interest) features. Next, we project the knowledge representations of foreground categories into a parameter space by a linear transformation to generate the parameters of the category-level classifier. For the background, we learn a proxy category by concluding the global characteristics of all foreground categories to help ensure the discrepancy between the foreground and background, which is then projected into the parameter space by the same linear transformation. Finally, we leverage the parameters of the category-level classifier to explicitly calibrate the instance-level classifier learned on the enhanced RoI features for both the foreground and background categories to improve the detection performance. We conduct extensive experiments on two popular FSOD benchmarks (i.e., Pascal VOC and MS COCO), and the experimental results show that the proposed framework can outperform state-of-the-art methods.

Abstract:
Regression based multi-person pose estimation receives increasing attention because of its promising potential in achieving realtime inference. However, the challenges in long-range 2D offset regression have restricted the regression accuracy, leading to a considerable performance gap compared with heatmap based methods. This paper tackles the challenge of long-range regression through simplifying the 2D offset regression to a classification task. We present a simple yet effective method, named PolarPose, to perform 2D regression in Polar coordinate. Through transforming the 2D offset regression in Cartesian coordinate to quantized orientation classification and 1D length estimation in the Polar coordinate, PolarPose effectively simplifies the regression task, making the framework easier to optimize. Moreover, to further boost the keypoint localization accuracy in PolarPose, we propose a multi-center regression to relieve the quantization error during orientation quantization. The resulting PolarPose framework is able to regress the keypoint offsets in a more reliable way, and achieves more accurate keypoint localization. Tested with the single-model and single-scale setting, PolarPose achieves the AP of 70.2% on COCO test-dev dataset, outperforming the state-of-the-art regression based methods. PolarPose also achieves promising efficiency, e.g., 71.5% AP at 21.5FPS and 68.5%AP at 24.2FPS and 65.5%AP at 27.2FPS on COCO val2017 dataset, faster than current state-of-the-art.

Abstract:
Multi-modal clustering (MMC) aims to explore complementary information from diverse modalities for clustering performance facilitating. This article studies challenging problems in MMC methods based on deep neural networks. On one hand, most existing methods lack a unified objective to simultaneously learn the inter- and intra-modality consistency, resulting in a limited representation learning capacity. On the other hand, most existing processes are modeled for a finite sample set and cannot handle out-of-sample data. To handle the above two challenges, we propose a novel Graph Embedding Contrastive Multi-modal Clustering network (GECMC), which treats the representation learning and multi-modal clustering as two sides of one coin rather than two separate problems. In brief, we specifically design a contrastive loss by benefiting from pseudo-labels to explore consistency across modalities. Thus, GECMC shows an effective way to maximize the similarities of intra-cluster representations while minimizing the similarities of inter-cluster representations at both inter- and intra-modality levels. So, the clustering and representation learning interact and jointly evolve in a co-training framework. After that, we build a clustering layer parameterized with cluster centroids, showing that GECMC can learn the clustering labels with given samples and handle out-of-sample data. GECMC yields superior results than 14 competitive methods on four challenging datasets. Codes and datasets are available: https://github.com/xdweixia/GECMC.

Abstract:
Real-world face super-resolution (SR) is a highly ill-posed image restoration task. The fully-cycled Cycle-GAN architecture is widely employed to achieve promising performance on face SR, but is prone to produce artifacts upon challenging cases in real-world scenarios, since joint participation in the same degradation branch will impact final performance due to huge domain gap between real-world and synthetic LR ones obtained by generators. To better exploit the powerful generative capability of GAN for real-world face SR, in this paper, we establish two independent degradation branches in the forward and backward cycle-consistent reconstruction processes, respectively, while the two processes share the same restoration branch. Our Semi-Cycled Generative Adversarial Networks (SCGAN) is able to alleviate the adverse effects of the domain gap between the real-world LR face images and the synthetic LR ones, and to achieve accurate and robust face SR performance by the shared restoration branch regularized by both the forward and backward cycle-consistent learning processes. Experiments on two synthetic and two real-world datasets demonstrate that, our SCGAN outperforms the state-of-the-art methods on recovering the face structures/details and quantitative metrics for real-world face SR. The code will be publicly released at https://github.com/HaoHou-98/SCGAN.

Abstract:
Plenoptic images and videos bearing rich information demand a tremendous amount of data storage and high transmission cost. While there has been much study on plenoptic image coding, investigations into plenoptic video coding have been very limited. We investigate the motion compensation (or so-called temporal prediction) for plenoptic video coding from a slightly different perspective by looking at the problem in the ray-space domain instead of in the conventional pixel domain. Here, we develop a novel motion compensation scheme for lenslet video under two sub-cases of ray-space motion, that is, integer ray-space motion and fractional ray-space motion. The proposed new scheme of light field motion-compensated prediction is designed such that it can be easily integrated into well-known video coding techniques such as HEVC. Experimental results compared to relevant existing methods have shown remarkable compression efficiency with an average gain of 20.03% and 21.76% respectively under “Low delayed B ” and “Random Access” configurations of HEVC.

Abstract:
Deep Metric Learning (DML) plays a critical role in various machine learning tasks. However, most existing deep metric learning methods with binary similarity are sensitive to noisy labels, which are widely present in real-world data. Since these noisy labels often cause a severe performance degradation, it is crucial to enhance the robustness and generalization ability of DML. In this paper, we propose an Adaptive Hierarchical Similarity Metric Learning method. It considers two noise-insensitive information, i.e., class-wise divergence and sample-wise consistency. Specifically, class-wise divergence can effectively excavate richer similarity information beyond binary in modeling by taking advantage of Hyperbolic metric learning, while sample-wise consistency can further improve the generalization ability of the model using contrastive augmentation. More importantly, we design an adaptive strategy to integrate this information in a unified view. It is noteworthy that the new method can be extended to any pair-based metric loss. Extensive experimental results on benchmark datasets demonstrate that our method achieves state-of-the-art performance compared with current deep metric learning approaches.

Abstract:
Accurate correspondence selection between two images is of great importance for numerous feature matching based vision tasks. The initial correspondences established by off-the-shelf feature extraction methods usually contain a large number of outliers, and this often leads to the difficulty in accurately and sufficiently capturing contextual information for the correspondence learning task. In this paper, we propose a Preference-Guided Filtering Network (PGFNet) to address this problem. The proposed PGFNet is able to effectively select correct correspondences and simultaneously recover the accurate camera pose of matching images. Specifically, we first design a novel iterative filtering structure to learn the preference scores of correspondences for guiding the correspondence filtering strategy. This structure explicitly alleviates the negative effects of outliers so that our network is able to capture more reliable contextual information encoded by the inliers for network learning. Then, to enhance the reliability of preference scores, we present a simple yet effective Grouped Residual Attention block as our network backbone, by designing a feature grouping strategy, a feature grouping manner, a hierarchical residual-like manner and two grouped attention operations. We evaluate PGFNet by extensive ablation studies and comparative experiments on the tasks of outlier removal and camera pose estimation. The results demonstrate outstanding performance gains over the existing state-of-the-art methods on different challenging scenes. The code is available at https://github.com/guobaoxiao/PGFNet.

Abstract:
In fringe projection profilometry (FPP) based on temporal phase unwrapping (TPU), reducing the number of projecting patterns has become one of the most important works in recent years. To remove the 2\pi ambiguity independently, this paper proposes a TPU method based on unequal phase-shifting code. Wrapped phase is still calculated from N -step conventional phase-shifting patterns with equal phase-shifting amount to guarantee the measuring accuracy. Particularly, a series of different phase-shifting amounts relative to the first phase-shifting pattern are set as codewords, and encoded to different periods to generate one coded pattern. When decoding, Fringe order with a large number can be determined from the conventional and coded wrapped phases. In addition, we develop a self-correction method to eliminate the deviation between the edge of fringe order and the 2\pi discontinuity. Thus, the proposed method can achieve TPU but need to only project one additional coded pattern (e. g. 3+1), which can significantly benefit dynamic 3D shape reconstruction. The theoretical and experimental analysis verify that the proposed method performs high robustness on the reflectivity of the isolated object while ensuring the measuring speed.

Abstract:
In this paper, we explore the problem of deep multi-view subspace clustering framework from an information-theoretic point of view. We extend the traditional information bottleneck principle to learn common information among different views in a self-supervised manner, and accordingly establish a new framework called Self-supervised Information Bottleneck based Multi-view Subspace Clustering (SIB-MSC). Inheriting the advantages from information bottleneck, SIB-MSC can learn a latent space for each view to capture common information among the latent representations of different views by removing superfluous information from the view itself while retaining sufficient information for the latent representations of other views. Actually, the latent representation of each view provides a kind of self-supervised signal for training the latent representations of other views. Moreover, SIB-MSC attempts to disengage the other latent space for each view to capture the view-specific information by introducing mutual information based regularization terms, so as to further improve the performance of multi-view subspace clustering. Extensive experiments on real-world multi-view data demonstrate that our method achieves superior performance over the related state-of-the-art methods.

Abstract:
Generalized zero-shot video classification aims to train a classifier to classify videos including both seen and unseen classes. Since the unseen videos have no visual information during training, most existing methods rely on the generative adversarial networks to synthesize visual features for unseen classes through the class embedding of category names. However, most category names only describe the content of the video, ignoring other relational information. As a rich information carrier, videos include actions, performers, environments, etc., and the semantic description of the videos also express the events from different levels of actions. In order to use fully explore the video information, we propose a fine-grained feature generation model based on video category name and its corresponding description texts for generalized zero-shot video classification. To obtain comprehensive information, we first extract content information from coarse-grained semantic information (category names) and motion information from fine-grained semantic information (description texts) as the base for feature synthesis. Then, we subdivide motion into hierarchical constraints on the fine-grained correlation between event and action from the feature level. In addition, we propose a loss that can avoid the imbalance of positive and negative examples to constrain the consistency of features at each level. In order to prove the validity of our proposed framework, we perform extensive quantitative and qualitative evaluations on two challenging datasets: UCF101 and HMDB51, and obtain a positive gain for the task of generalized zero-shot video classification.

Abstract:
Recently, contrastive learning based on augmentation invariance and instance discrimination has made great achievements, owing to its excellent ability to learn beneficial representations without any manual annotations. However, the natural similarity among instances conflicts with instance discrimination which treats each instance as a unique individual. In order to explore the natural relationship among instances and integrate it into contrastive learning, we propose a novel approach in this paper, Relationship Alignment (RA for abbreviation), which forces different augmented views of current batch instances to main a consistent relationship with other instances. In order to perform RA effectively in existing contrastive learning framework, we design an alternating optimization algorithm where the relationship exploration step and alignment step are optimized respectively. In addition, we add an equilibrium constraint for RA to avoid the degenerate solution, and introduce the expansion handler to make it approximately satisfied in practice. In order to better capture the complex relationship among instances, we additionally propose Multi-Dimensional Relationship Alignment (MDRA for abbreviation), which aims to explore the relationship from multiple dimensions. In practice, we decompose the final high-dimensional feature space into a cartesian product of several low-dimensional subspaces and perform RA in each subspace respectively. We validate the effectiveness of our approach on multiple self-supervised learning benchmarks and get consistent improvements compared with current popular contrastive learning methods. On the most commonly used ImageNet linear evaluation protocol, our RA obtains significant improvements over other methods, our MDRA gets further improvements based on RA to achieve the best performance. The source code of our approach will be released soon.

Abstract:
Pansharpening refers to the fusion of a low spatial-resolution multispectral image with a high spatial-resolution panchromatic image. In this paper, we propose a novel low-rank tensor completion (LRTC)-based framework with some regularizers for multispectral image pansharpening, called LRTCFPan. The tensor completion technique is commonly used for image recovery, but it cannot directly perform the pansharpening or, more generally, the super-resolution problem because of the formulation gap. Different from previous variational methods, we first formulate a pioneering image super-resolution (ISR) degradation model, which equivalently removes the downsampling operator and transforms the tensor completion framework. Under such a framework, the original pansharpening problem is realized by the LRTC-based technique with some deblurring regularizers. From the perspective of regularizer, we further explore a local-similarity-based dynamic detail mapping (DDM) term to more accurately capture the spatial content of the panchromatic image. Moreover, the low-tubal-rank property of multispectral images is investigated, and the low-tubal-rank prior is introduced for better completion and global characterization. To solve the proposed LRTCFPan model, we develop an alternating direction method of multipliers (ADMM)-based algorithm. Comprehensive experiments at reduced-resolution (i.e., simulated) and full-resolution (i.e., real) data exhibit that the LRTCFPan method significantly outperforms other state-of-the-art pansharpening methods. The code is publicly available at: https://github.com/zhongchengwu/code_LRTCFPan.

Abstract:
Beyond high accuracy, good interpretability is very critical to deploy a face forgery detection model for visual content analysis. In this paper, we propose learning patch-channel correspondence to facilitate interpretable face forgery detection. Patch-channel correspondence aims to transform the latent features of a facial image into multi-channel interpretable features where each channel mainly encoders a corresponding facial patch. Towards this end, our approach embeds a feature reorganization layer into a deep neural network and simultaneously optimizes classification task and correspondence task via alternate optimization. The correspondence task accepts multiple zero-padding facial patch images and represents them into channel-aware interpretable representations. The task is solved by step-wisely learning channel-wise decorrelation and patch-channel alignment. Channel-wise decorrelation decouples latent features for class-specific discriminative channels to reduce feature complexity and channel correlation, while patch-channel alignment then models the pairwise correspondence between feature channels and facial patches. In this way, the learned model can automatically discover corresponding salient features associated to potential forgery regions during inference, providing discriminative localization of visualized evidences for face forgery detection while maintaining high detection accuracy. Extensive experiments on popular benchmarks clearly demonstrate the effectiveness of the proposed approach in interpreting face forgery detection without sacrificing accuracy. The source code is available at https://github.com/Jae35/IFFD.

Abstract:
Learning hash functions have been widely applied for large-scale image retrieval. Existing methods usually use CNNs to process an entire image at once, which is efficient for single-label images but not for multi-label images. First, these methods cannot fully exploit independent features of different objects in one image, resulting in some small object features with important information being ignored. Second, the methods cannot capture different semantic information from dependency relations among objects. Third, the existing methods ignore the impacts of imbalance between hard and easy training pairs, resulting in suboptimal hash codes. To address these issues, we propose a novel deep hashing method, termed multi-label hashing for dependency relations among multiple objectives (DRMH). We first utilize an object detection network to extract object feature representations to avoid ignoring small object features and then fuse object visual features with position features and further capture dependency relations among objects using a self-attention mechanism. In addition, we design a weighted pairwise hash loss to solve the imbalance problem between hard and easy training pairs. Extensive experiments are conducted on multi-label datasets and zero-shot datasets, and the proposed DRMH outperforms many state-of-the-art hashing methods with respect to different evaluation metrics.

Abstract:
Camouflaged object detection, which aims to detect/segment the object(s) that blend in with their surrounding, remains challenging for deep models due to the intrinsic similarities between foreground objects and background surroundings. Ideally, an effective model should be capable of finding valuable clues from the given scene and integrating them into a joint learning framework to co-enhance the representation. Inspired by this observation, we propose a novel Mutual Graph Learning (MGL) model by shifting the conventional perspective of mutual learning from regular grids to graph domain. Specifically, an image is decoupled by MGL into two task-specific feature maps — one for finding the rough location of the target and the other for capturing its accurate boundary details. Then, the mutual benefits can be fully exploited by reasoning their high-order relations through graphs recurrently. It should be noted that our method is different from most mutual learning models that model all between-task interactions with the use of a shared function. To increase information interactions, MGL is built with typed functions for dealing with different complementary relations. To overcome the accuracy loss caused by interpolation to higher resolution and the computational redundancy resulting from recurrent learning, the S-MGL is equipped with a multi-source attention contextual recovery module, called R-MGL_v2, which uses the pixel feature information iteratively. Experiments on challenging datasets, including CHAMELEON, CAMO, COD10K, and NC4K demonstrate the effectiveness of our MGL with superior performance to existing state-of-the-art methods. The code can be found at https://github.com/fanyang587/MGL.

Affiliations: School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan, China; Information Systems Technology and Design Pillar, Singapore University of Technology and Design, Tampines, Singapore; College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China; College of Computer and Information Technology, China Three Gorges University, Yichang, China; School of Information Science and Engineering, Yunnan University, Kunming, China

Abstract:
Most facial landmark detection methods predict landmarks by mapping the input facial appearance features to landmark heatmaps and have achieved promising results. However, when the face image is suffering from large poses, heavy occlusions and complicated illuminations, they cannot learn discriminative feature representations and effective facial shape constraints, nor can they accurately predict the value of each element in the landmark heatmap, limiting their detection accuracy. To address this problem, we propose a novel Reference Heatmap Transformer (RHT) by introducing reference heatmap information for more precise facial landmark detection. The proposed RHT consists of a Soft Transformation Module (STM) and a Hard Transformation Module (HTM), which can cooperate with each other to encourage the accurate transformation of the reference heatmap information and facial shape constraints. Then, a Multi-Scale Feature Fusion Module (MSFFM) is proposed to fuse the transformed heatmap features and the semantic features learned from the original face images to enhance feature representations for producing more accurate target heatmaps. To the best of our knowledge, this is the first study to explore how to enhance facial landmark detection by transforming the reference heatmap information. The experimental results from challenging benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art methods in the literature.

Abstract:
Source-free unsupervised domain adaptation (SFUDA) aims to learn a target domain model using unlabeled target data and the knowledge of a well-trained source domain model. Most previous SFUDA works focus on inferring semantics of target data based on the source knowledge. Without measuring the transferability of the source knowledge, these methods insufficiently exploit the source knowledge, and fail to identify the reliability of the inferred target semantics. However, existing transferability measurements require either source data or target labels, which are infeasible in SFUDA. To this end, firstly, we propose a novel Uncertainty-induced Transferability Representation (UTR), which leverages uncertainty as the tool to analyse the channel-wise transferability of the source encoder in the absence of the source data and target labels. The domain-level UTR unravels how transferable the encoder channels are to the target domain and the instance-level UTR characterizes the reliability of the inferred target semantics. Secondly, based on the UTR, we propose a novel Calibrated Adaption Framework (CAF) for SFUDA, including i) the source knowledge calibration module that guides the target model to learn the transferable source knowledge and discard the non-transferable one, and ii) the target semantics calibration module that calibrates the unreliable semantics. With the help of the calibrated source knowledge and the target semantics, the model adapts to the target domain safely and ultimately better. We verified the effectiveness of our method using experimental results and demonstrated that the proposed method achieves state-of-the-art performances on the three SFUDA benchmarks. Code is available at https://github.com/SPIresearch/UTR.

Abstract:
Infrared image segmentation is a challenging task, due to interference of complex background and appearance inhomogeneity of foreground objects. A critical defect of fuzzy clustering for infrared image segmentation is that the method treats image pixels or fragments in isolation. In this paper, we propose to adopt self-representation from sparse subspace clustering in fuzzy clustering, aiming to introduce global correlation information into fuzzy clustering. Meanwhile, to apply sparse subspace clustering for non-linear samples from an infrared image, we leverage membership from fuzzy clustering to improve conventional sparse subspace clustering. The contributions of this paper are fourfold. First, by introducing self-representation coefficients modeled in sparse subspace clustering based on high-dimensional features, fuzzy clustering is capable of utilizing global information to resist complex background as well as intensity inhomogeneity of objects, so as to improve clustering accuracy. Second, fuzzy membership is tactfully exploited in the sparse subspace clustering framework. Thereby, the bottleneck of conventional sparse subspace clustering methods, that they could be barely applied to nonlinear samples, can be surmounted. Third, as we integrate fuzzy clustering and subspace clustering in a unified framework, features from two different aspects are employed, contributing to precise clustering results. Finally, we further incorporate neighbor information into clustering, thus effectively solving the uneven intensity problem in infrared image segmentation. Experiments examine the feasibility of proposed methods on various infrared images. Segmentation results demonstrate the effectiveness and efficiency of the proposed methods, which proves the superiority compared to other fuzzy clustering methods and sparse space clustering methods.

Abstract:
Visual cryptography scheme (VCS) serves as an effective tool in image security. Size-invariant VCS (SI-VCS) can solve the pixel expansion problem in traditional VCS. On the other hand, it is anticipated that the contrast of the recovered image in SI-VCS should be as high as possible. The investigation of contrast optimization for SI-VCS is carried out in this article. We develop an approach to optimize the contrast by stacking t ( k \le t \le n ) shadows in (k, n) -SI-VCS. Generally, a contrast-maximizing problem is linked with a (k, n) -SI-VCS, where the contrast by t shadows is considered as an objective function. An ideal contrast by t shadows can be produced by addressing this problem using linear programming. However, there exist (n-k+1) different contrasts in a (k, n) scheme. An optimization-based design is further introduced to provide multiple optimal contrasts. These (n-k+1) different contrasts are regarded as objective functions and it is transformed into a multi-contrast-maximizing problem. The ideal point method and lexicographic method are adopted to address this problem. Additionally, if the Boolean XOR operation is used for secret recovery, a technique is also provided to offer multiple maximum contrasts. The effectiveness of the proposed schemes is verified by extensive experiments. Comparisons illustrate significant advancement on contrast is provided.

Abstract:
Deep unfolding network (DUN) that unfolds the optimization algorithm into a deep neural network has achieved great success in compressive sensing (CS) due to its good interpretability and high performance. Each stage in DUN corresponds to one iteration in optimization. At the test time, all the sampling images generally need to be processed by all stages, which comes at a price of computation burden and is also unnecessary for the images whose contents are easier to restore. In this paper, we focus on CS reconstruction and propose a novel Dynamic Path-Controllable Deep Unfolding Network (DPC-DUN). DPC-DUN with our designed path-controllable selector can dynamically select a rapid and appropriate route for each image and is slimmable by regulating different performance-complexity tradeoffs. Extensive experiments show that our DPC-DUN is highly flexible and can provide excellent performance and dynamic adjustment to get a suitable tradeoff, thus addressing the main requirements to become appealing in practice. Codes are available at https://github.com/songjiechong/DPC-DUN.

Abstract:
Conventional Few-shot classification (FSC) aims to recognize samples from novel classes given limited labeled data. Recently, domain generalization FSC (DG-FSC) has been proposed with the goal to recognize novel class samples from unseen domains. DG-FSC poses considerable challenges to many models due to the domain shift between base classes (used in training) and novel classes (encountered in evaluation). In this work, we make two novel contributions to tackle DG-FSC. Our first contribution is to propose Born-Again Network (BAN) episodic training and comprehensively investigate its effectiveness for DG-FSC. As a specific form of knowledge distillation, BAN has been shown to achieve improved generalization in conventional supervised classification with a closed-set setup. This improved generalization motivates us to study BAN for DG-FSC, and we show that BAN is promising to address the domain shift encountered in DG-FSC. Building on the encouraging findings, our second (major) contribution is to propose Few-Shot BAN (FS-BAN), a novel BAN approach for DG-FSC. Our proposed FS-BAN includes novel multi-task learning objectives: Mutual Regularization, Mismatched Teacher, and Meta-Control Temperature, each of these is specifically designed to overcome central and unique challenges in DG-FSC, namely overfitting and domain discrepancy. We analyze different design choices of these techniques. We conduct comprehensive quantitative and qualitative analysis and evaluation over six datasets and three baseline models. The results suggest that our proposed FS-BAN consistently improves the generalization performance of baseline models and achieves state-of-the-art accuracy for DG-FSC. Project Page: yunqing-me.github.io/Born-Again-FS/.

Abstract:
Camouflaged object detection (COD) aims to discover objects that blend in with the background due to similar colors or textures, etc. Existing deep learning methods do not systematically illustrate the key tasks in COD, which seriously hinders the improvement of its performance. In this paper, we introduce the concept of focus areas that represent some regions containing discernable colors or textures, and develop a two-stage focus scanning network for camouflaged object detection. Specifically, a novel encoder-decoder module is first designed to determine a region where the focus areas may appear. In this process, a multi-layer Swin transformer is deployed to encode global context information between the object and the background, and a novel cross-connection decoder is proposed to fuse cross-layer textures or semantics. Then, we utilize the multi-scale dilated convolution to obtain discriminative features with different scales in focus areas. Meanwhile, the dynamic difficulty aware loss is designed to guide the network paying more attention to structural details. Extensive experimental results on the benchmarks, including CAMO, CHAMELEON, COD10K, and NC4K, illustrate that the proposed method performs favorably against other state-of-the-art methods.

Abstract:
Multitemporal hyperspectral unmixing (MTHU) is a fundamental tool in the analysis of hyperspectral image sequences. It reveals the dynamical evolution of the materials (endmembers) and of their proportions (abundances) in a given scene. However, adequately accounting for the spatial and temporal variability of the endmembers in MTHU is challenging, and has not been fully addressed so far in unsupervised frameworks. In this work, we propose an unsupervised MTHU algorithm based on variational recurrent neural networks. First, a stochastic model is proposed to represent both the dynamical evolution of the endmembers and their abundances, as well as the mixing process. Moreover, a new model based on a low-dimensional parametrization is used to represent spatial and temporal endmember variability, significantly reducing the amount of variables to be estimated. We propose to formulate MTHU as a Bayesian inference problem. However, the solution to this problem does not have an analytical solution due to the nonlinearity and non-Gaussianity of the model. Thus, we propose a solution based on deep variational inference, in which the posterior distribution of the estimated abundances and endmembers is represented by using a combination of recurrent neural networks and a physically motivated model. The parameters of the model are learned using stochastic backpropagation. Experimental results show that the proposed method outperforms state of the art MTHU algorithms.

Affiliations: Shaanxi Key Laboratory of Clothing Intelligence, the School of Computer Science, and the School of Electronics and Information, Xi’an Polytechnic University, Xi’an, China; School of Electronics and Information, Xi’an Polytechnic University, Xi'an, China; Video and Image Processing System Laboratory, School of Electronic Engineering, Xidian University, Xi’an, China; Chongqing Key Laboratory of Image Cognition, Chongqing University of Posts and Telecommunications, Chongqing, China

Abstract:
Low-light images incur several complicated degradation factors such as poor brightness, low contrast, color degradation, and noise. Most previous deep learning-based approaches, however, only learn the mapping relationship of single channel between the input low-light images and the expected normal-light images, which is insufficient enough to deal with low-light images captured under uncertain imaging environment. Moreover, too deeper network architecture is not conducive to recover low-light images due to extremely low values in pixels. To surmount aforementioned issues, in this paper we propose a novel multi-branch and progressive network (MBPNet) for low-light image enhancement. To be more specific, the proposed MBPNet is comprised of four different branches which build the mapping relationship at different scales. The followed fusion is performed on the outputs obtained from four different branches for the final enhanced image. Furthermore, to better handle the difficulty of delivering structural information of low-light images with low values in pixels, a progressive enhancement strategy is applied in the proposed method, where four convolutional long short-term memory networks (LSTM) are embedded in four branches and an recurrent network architecture is developed to iteratively perform the enhancement process. In addition, a joint loss function consisting of the pixel loss, the multi-scale perceptual loss, the adversarial loss, the gradient loss, and the color loss is framed to optimize the model parameters. To evaluate the effectiveness of proposed MBPNet, three popularly used benchmark databases are used for both quantitative and qualitative assessments. The experimental results confirm that the proposed MBPNet obviously outperforms other state-of-the-art approaches in terms of quantitative and qualitative results. The code will be available at https://github.com/kbzhang0505/MBPNet.

Abstract:
In real-world scenarios, collected and annotated data often exhibit the characteristics of multiple classes and long-tailed distribution. Additionally, label noise is inevitable in large-scale annotations and hinders the applications of learning-based models. Although many deep learning based methods have been proposed for handling long-tailed multi-label recognition or label noise respectively, learning with noisy labels in long-tailed multi-label visual data has not been well-studied because of the complexity of long-tailed distribution entangled with multi-label correlation. To tackle such a critical yet thorny problem, this paper focuses on reducing noise based on some inherent properties of multi-label classification and long-tailed learning under noisy cases. In detail, we propose a Stitch-Up augmentation to synthesize a cleaner sample, which directly reduces multi-label noise by stitching up multiple noisy training samples. Equipped with Stitch-Up, a Heterogeneous Co-Learning framework is further designed to leverage the inconsistency between long-tailed and balanced distributions, yielding cleaner labels for more robust representation learning with noisy long-tailed data. To validate our method, we build two challenging benchmarks, named VOC-MLT-Noise and COCO-MLT-Noise, respectively. Extensive experiments are conducted to demonstrate the effectiveness of our proposed method. Compared to a variety of baselines, our method achieves superior results.

Abstract:
Salient object detection (SOD) is an important task in computer vision that aims to identify visually conspicuous regions in images. RGB-Thermal SOD combines two spectra to achieve better segmentation results. However, most existing methods for RGB-T SOD use boundary maps to learn sharp boundaries, which lead to sub-optimal performance as they ignore the interactions between isolated boundary pixels and other confident pixels. To address this issue, we propose a novel position-aware relation learning network (PRLNet) for RGB-T SOD. PRLNet explores the distance and direction relationships between pixels by designing an auxiliary task and optimizing the feature structure to strengthen intra-class compactness and inter-class separation. Our method consists of two main components: A signed distance map auxiliary module (SDMAM), and a feature refinement approach with direction field (FRDF). SDMAM improves the encoder feature representation by considering the distance relationship between foreground-background pixels and boundaries, which increases the inter-class separation between foreground and background features. FRDF rectifies the features of boundary neighborhoods by exploiting the features inside salient objects. It utilizes the direction relationship of object pixels to enhance the intra-class compactness of salient features. In addition, we constitute a transformer-based decoder to decode multispectral feature representation. Experimental results on three public RGB-T SOD datasets demonstrate that our proposed method not only outperforms the state-of-the-art methods, but also can be integrated with different backbone networks in a plug-and-play manner. Ablation study and visualizations further prove the validity and interpretability of our method.

Abstract:
Visual object navigation is an essential task of embodied AI, which is letting the agent navigate to the goal object under the user’s demand. Previous methods often focus on single-object navigation. However, in real life, human demands are generally continuous and multiple, requiring the agent to implement multiple tasks in sequence. These demands can be addressed by repeatedly performing previous single task methods. However, by dividing multiple tasks into several independent tasks to perform, without the global optimization between different tasks, the agents’ trajectories may overlap, reducing the efficiency of navigation. In this paper, we propose an efficient reinforcement learning framework with a hybrid policy for multi-object navigation, aiming to maximally eliminate noneffective actions. First, the visual observations are embedded to detect the semantic entities (such as objects). And the detected objects are memorized and projected into semantic maps, which can also be regarded as a long-term memory of the observed environment. Then a hybrid policy consisting of exploration and long-term planning strategies is proposed to predict the potential target position. In particular, when the target is directly oriented, the policy function makes long-term planning for the target based on the semantic map, which is implemented by a sequence of motion actions. In the alternative, when the target is not oriented, the policy function estimates an object’s potential position toward exploring the most possible objects (positions) that have close relations to the target. The relation between different objects is obtained with prior knowledge, which is used to predict the potential target position by integrating with the memorized semantic map. And then a path to the potential target is planned by the policy function. We evaluate our proposed method on two large-scale 3D realistic environment datasets, Gibson and Matterport3D, and the experimental results demonstrate the effectiveness and generalization of the proposed method.

Abstract:
Single-image deraining aims to restore the image that is degraded by the rain streaks, where the long-standing bottleneck lies in how to disentangle the rain streaks from the given rainy image. Despite the progress made by substantial existing works, several crucial questions — e.g., How to distinguish rain streaks and clean image, while how to disentangle rain streaks from low-frequency pixels, and further prevent the blurry edges — have not been well investigated. In this paper, we attempt to solve all of them under one roof. Our observation is that rain streaks are bright stripes with higher pixel values that are evenly distributed in each color channel of the rainy image, while the disentanglement of the high-frequency rain streaks is equivalent to decreasing the standard deviation of the pixel distribution for the rainy image. To this end, we propose a self-supervised rain streaks learning network to characterize the similar pixel distribution of the rain streaks from a macroscopic viewpoint over various low-frequency pixels of gray-scale rainy images, coupling with a supervised rain streaks learning network to explore the specific pixel distribution of the rain streaks from a microscopic viewpoint between each paired rainy and clean images. Building on this, a self-attentive adversarial restoration network comes up to prevent the further blurry edges. These networks compose an end-to-end Macroscopic-and-Microscopic Rain Streaks Disentanglement Network, named \textM^2 RSD-Net, to learn rain streaks, which is further removed for single image deraining. The experimental results validate its advantages on deraining benchmarks against the state-of-the-arts. The code is available at: https://github.com/xinjiangaohfut/MMRSD-Net

Abstract:
Learning pyramidal feature representations is important for many dense prediction tasks (e.g., object detection, semantic segmentation) that demand multi-scale visual understanding. Feature Pyramid Network (FPN) is a well-known architecture for multi-scale feature learning, however, intrinsic weaknesses in feature extraction and fusion impede the production of informative features. This work addresses the weaknesses of FPN through a novel tripartite feature enhanced pyramid network (TFPN), with three distinct and effective designs. First, we develop a feature reference module with lateral connections to adaptively extract bottom-up features with richer details for feature pyramid construction. Second, we design a feature calibration module between adjacent layers that calibrates the upsampled features to be spatially aligned, allowing for feature fusion with accurate correspondences. Third, we introduce a feature feedback module in FPN, which creates a communication channel from the feature pyramid back to the bottom-up backbone and doubles the encoding capacity, enabling the entire architecture to generate incrementally more powerful representations. The TFPN is extensively evaluated over four popular dense prediction tasks, i.e., object detection, instance segmentation, panoptic segmentation, and semantic segmentation. The results demonstrate that TFPN consistently and significantly outperforms the vanilla FPN. Our code is available at https://github.com/jamesliang819.

Abstract:
Video quality assessment (VQA) has received remarkable attention recently. Most of the popular VQA models employ recurrent neural networks (RNNs) to capture the temporal quality variation of videos. However, each long-term video sequence is commonly labeled with a single quality score, with which RNNs might not be able to learn long-term quality variation well: What’s the real role of RNNs in learning the visual quality of videos? Does it learn spatio-temporal representation as expected or just aggregating spatial features redundantly? In this study, we conduct a comprehensive study by training a family of VQA models with carefully designed frame sampling strategies and spatio-temporal fusion methods. Our extensive experiments on four publicly available in- the-wild video quality datasets lead to two main findings. First, the plausible spatio-temporal modeling module (i. e., RNNs) does not facilitate quality-aware spatio-temporal feature learning. Second, sparsely sampled video frames are capable of obtaining the competitive performance against using all video frames as the input. In other words, spatial features play a vital role in capturing video quality variation for VQA. To our best knowledge, this is the first work to explore the issue of spatio-temporal modeling in VQA.

Abstract:
Approximate message passing-based compressive sensing reconstruction has received increasing attention, the performance of which depends heavily on the ability of the denoising operator. However, most methods only employ an off-the-shelf denoising model as the denoising operator of the iteration solver, which imposes an unfavorable limit on reconstruction performance of compressive sensing. To solve the aforementioned issue, we propose a novel versatile denoising-based approximate message passing model, abbreviated as VD-AMP, for compressive sensing (CS) recovery. To be specific, we meticulously design a double encoder-decoder denoising network (DEDNet), which manifests the impressive performance in Gaussian denoising. Moreover, a fine-grained noise level division (FNLD) solution is proposed to release the potential of the well-designed DEDNet so as to improve the reconstruction performance. However, strengthening the denoiser alone fails to remove the distortion artifact of reconstruction images at low sampling rates. To alleviate the defect, we propose an anti-aliasing sampling (AS), which firstly maps the input image to a smoothing sub-space using the proposed DEDNet before vanilla sampling, reducing aliasing between high-frequency and low-frequency information on measurement. Extensive experiments on benchmark datasets demonstrate that the proposed VD-AMP significantly outperforms state-of-the-art CS reconstruction models by a large margin, e.g., up to 2 dB gains on PSNR.

Abstract:
Convolutional Neural Networks (CNNs) dominate image processing but suffer from local inductive bias, which is addressed by the transformer framework with its inherent ability to capture global context through self-attention mechanisms. However, how to inherit and integrate their advantages to improve compressed sensing is still an open issue. This paper proposes CSformer, a hybrid framework to explore the representation capacity of local and global features. The proposed approach is well-designed for end-to-end compressive image sensing, composed of adaptive sampling and recovery. In the sampling module, images are measured block-by-block by the learned sampling matrix. In the reconstruction stage, the measurements are projected into an initialization stem, a CNN stem, and a transformer stem. The initialization stem mimics the traditional reconstruction of compressive sensing but generates the initial reconstruction in a learnable and efficient manner. The CNN stem and transformer stem are concurrent, simultaneously calculating fine-grained and long-range features and efficiently aggregating them. Furthermore, we explore a progressive strategy and window-based transformer block to reduce the parameters and computational complexity. The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing, which achieves superior performance compared to state-of-the-art methods on different datasets. Our codes is available at: https://github.com/Lineves7/CSformer.

Abstract:
Point cloud semantic segmentation (PCSS), for the purpose of labeling a set of points stored in irregular and unordered structures, is an important yet challenging task. It is vital for the task of learning a good representation for each 3D data point, which encodes rich context knowledge and hierarchically structural information. However, despite great success has been achieved by existing PCSS methods, they are limited to make full use of important context information and rich hierarchical features for representation learning. In this paper, we propose to build ‘hyperpoint’ representations for 3D data point via a nested network architecture, which is able to explicitly exploit multi-scale, pyramidally hierarchical features and construct powerful representations for PCSS. In particular, we introduce a PCSS nested architecture search (PCSS-NAS) algorithm to automatically design the model’s side-output branches at different levels as well as its skip-layer structures, enabling the resulting model to best deal with the scale-space problem. Our searched architecture, named Auto-NestedNet, is evaluated on four well-known benchmarks: S3DIS, ScanNet, Semantic3D and Paris-Lille-3D. Experimental results show that the proposed Auto-NestedNet achieves the state-of-the-art performance. Our source code is available at https://github.com/fanyang587/NestedNet.

Abstract:
Multi-label image classification is a fundamental but challenging task in computer vision. To tackle the problem, the label-related semantic information is often exploited, but the background context and spatial semantic information of related objects are not fully utilized. To address these issues, a multi-branch deep neural network is proposed in this paper. The first branch is designed to extract the discriminant information from regions of interest to detect target objects. In the second branch, a spatial context-aware approach is proposed to better capture the contextual information of an object in its surroundings by using an adaptive patch expansion mechanism. It helps the detection of small objects that are easily lost without the support of context information. The third one, the object-attentional branch, exploits the spatial semantic relations between the target object and its related objects, to better detect partially occluded, small or dim objects with the support of those easily detectable objects. To better encode such relations, an attention mechanism jointly considering the spatial and semantic relations between objects is developed. Two widely used benchmark datasets for multi-labeling classification, MS COCO and PASCAL VOC, are used to evaluate the proposed framework. The experimental results demonstrate that the proposed method outperforms the state-of-the-art methods for multi-label image classification.

Abstract:
Current video semantic segmentation tasks involve two main challenges: how to take full advantage of multi-frame context information, and how to improve computational efficiency. To tackle the two challenges simultaneously, we present a novel Multi-Granularity Context Network (MGCNet) by aggregating context information at multiple granularities in a more effective and efficient way. Our method first converts image features into semantic prototypes, and then conducts a non-local operation to aggregate the per-frame and short-term contexts jointly. An additional long-term context module is introduced to capture the video-level semantic information during training. By aggregating both local and global semantic information, a strong feature representation is obtained. The proposed pixel-to-prototype non-local operation requires less computational cost than traditional non-local ones, and is video-friendly since it reuses the semantic prototypes of previous frames. Moreover, we propose an uncertainty-aware and structural knowledge distillation strategy to boost the performance of our method. Experiments on Cityscapes and CamVid datasets with multiple backbones demonstrate that the proposed MGCNet outperforms other state-of-the-art methods with high speed and low latency.

Abstract:
In contrast to image compression, the key of video compression is to efficiently exploit the temporal context for reducing the inter-frame redundancy. Existing learned video compression methods generally rely on utilizing short-term temporal correlations or image-oriented codecs, which prevents further improvement of the coding performance. This paper proposed a novel temporal context-based video compression network (TCVC-Net) for improving the performance of learned video compression. Specifically, a global temporal reference aggregation (GTRA) module is proposed to obtain an accurate temporal reference for motion-compensated prediction by aggregating long-term temporal context. Furthermore, in order to efficiently compress the motion vector and residue, a temporal conditional codec (TCC) is proposed to preserve structural and detailed information by exploiting the multi-frequency components in temporal context. Experimental results show that the proposed TCVC-Net outperforms public state-of-the-art methods in terms of both PSNR and MS-SSIM metrics.

Abstract:
Early activity prediction/recognition aims to recognize action categories before they are fully conveyed. Compared to full-length action sequences, partial video sequences only provide insufficient discrimination information, which makes predicting the class labels for some similar activities challenging, especially when only very few frames can be observed. To address this challenge, in this paper, we propose a novel meta negative network, namely, Magi-Net, that utilizes a contrastive learning scheme to alleviate the insufficiency of discriminative information. In our Magi-Net model, the positive samples are generated by augmenting an input anchor conditioned on all observation ratios, while the negative samples are selected from a trainable negative look-up memory (LUM) table, which stores the training samples and the corresponding misleading categories. Furthermore, a meta negative sample optimization strategy (MetaSOS) is proposed to boost the training of Magi-Net by encouraging the model to learn from the most informative negative samples via a meta learning scheme. Extensive experiments are conducted on several public skeleton-based activity datasets, and the results show the efficacy of the proposed Magi-Net model.

Abstract:
Low-rank tensor completion aims to recover the missing entries of multi-way data, which has become popular and vital in many fields such as signal processing and computer vision. It varies with different tensor decomposition frameworks. Compared with matrix SVD, recently emerging transform t-SVD can better characterize the low-rank structure of order-3 data. However, it suffers from rotation sensitivity, and dimensional limitation (i.e., only effective for order-3 tensors). To alleviate these deficiencies, we develop a novel multiplex transformed tensor decomposition (MTTD) framework, which can characterize the global low-rank structure along all modes for any order- N tensor. Based on MTTD, we propose a related multi-dimensional square model for low-rank tensor completion. Besides, a total variation term is also introduced to utilize the local piecewise smoothness of the tensor data. The classic alternating direction method of multipliers is used to solve the convex optimization problems. For performance testing, we choose three linear invertible transforms including FFT, DCT, and a group of unitary transform matrices for our proposed methods. The simulated and real-data experiments demonstrate the superior recovery accuracy and computational efficiency of our method compared with state-of-the-art ones.

Abstract:
Matching landmark patches from a real-time image captured by an on-vehicle camera with landmark patches in an image database plays an important role in various computer perception tasks for autonomous driving. Current methods focus on local matching for regions of interest and do not take into account spatial neighborhood relationships among the image patches, which typically correspond to objects in the environment. In this paper, we construct a spatial graph with the graph vertices corresponding to patches and edges capturing the spatial neighborhood information. We propose a joint feature and metric learning model with graph-based learning. We provide a theoretical basis for the graph-based loss by showing that the information distance between the distributions conditioned on matched and unmatched pairs is maximized under our framework. We evaluate our model using several street-scene datasets and demonstrate that our approach achieves state-of-the-art matching results.

Abstract:
Interpolation-friendly RGBW color filter arrays (CFAs) and the popular sequential demosaicking contain the idea of computational photography, where the CFA and the demosaicking method are co-designed. Due to the advantages, interpolation-friendly RGBW CFAs have been extensively used in commercial color cameras. However, most associated demosaicking methods rely on strict assumptions or are limited to a few specific CFAs with a given camera. In this paper, we propose a universal demosaicking method for interpolation-friendly RGBW CFAs, which enables the comparison of different CFAs. Our new method belongs to sequential demosaicking, i.e., W channel is interpolated first and then RGB channels are reconstructed with guidance from the interpolated W channel. Specifically, it first interpolates the W channel using only available W pixels followed by an aliasing reduction technique to remove aliasing artifacts. Then it employs an image decomposition model to built relations between W channel and each of RGB channels with known RGB values, which can be easily generalized to the full-size demosaicked image. We apply the linearized alternating direction method (LADM) to solve it with convergence guarantee. Our demosaicking method can be applied to all interpolation-friendly RGBW CFAs with varying color cameras and lighting conditions. Extensive experiments confirm the universal property and advantage of our proposed method with both simulated and real raw images.

Abstract:
Multispectral imaging (MSI) collects a datacube of spatio-spectral information of a scene. Many acquisition methods for spectral imaging use scanning, preventing its widespread usage for dynamic scenes. On the other hand, the conventional color filter array (CFA) method often used to sample color images has also been extended to snapshot MSI using a Multispectral Filter Array (MSFA), which is a mosaic of selective spectral filters placed over the Focal Plane Array (FPA). However, even state-ofthe- art MSFAs coding patterns produce artifacts and distortions in the reconstructed spectral images, which might be due to the nonoptimal distribution of the spectral filters. To reduce the appearance of artifacts and provide tools for the optimal design of MSFAs, this paper proposes a novel mathematical framework to design MSFAs using a Sphere Packing (SP) approach. By assuming that each sampled filter can be represented by a sphere within the discrete datacube, SP organizes the position of the equal-size and disjoint spheres’s centers in a cubic container. Our method is denoted Multispectral Filter Array by Optimal Sphere Packing (MSFA-OSP), which seeks filter positions that maximize the minimum distance between the spheres’s centers. Simulation results show an image quality improvement of up to 2 dB and a remarkable boost in spectral similarity when using our proposed MSFA design approach for a variety of reconstruction algorithms. Moreover, MSFA-OSP notably reduces the appearance of false colors and zipper effect artifacts, often seen when using state-of-the-art demosaicking algorithms. Experiments using synthetic and real data prove that the proposed MSFA-OSP outperforms state-of-the-art MSFAs in terms of spatial and spectral fidelity. The code that reproduces the figures of this article is available at https://github.com/nelson10/DemosaickingMultispectral3DSpherePacking.git.

Abstract:
With the development of deep learning technology, the performance of facial expression recognition (FER) has been significantly improved. The current main challenge comes from the confusion of facial expressions caused by the highly nonlinear changes of facial expressions. However, the existing FER methods based on Convolutional Neural Networks (CNN) often ignore the underlying relationship between expressions which is crucial to meliorate the performance of recognition for confusable expressions. And the methods based on Graph Convolutional Networks (GCN) can capture the relationship between vertices, but the aggregation degree of subgraphs generated by these methods is low. They are easy to include unconfident neighbors, which increases the learning difficulty of the network. To solve the above problems, this paper proposes a method to recognize facial expressions on the high aggregation subgraphs (HASs) by combing the advantages of CNN extracting features and GCN modeling complex graph patterns. Specifically, we formulate FER as a vertex prediction problem. Considering the importance of high-order neighbors and higher efficiency, we utilize vertex confidence to find high-order neighbors. Then we construct the HASs based on the top embedding features of these high-order neighbors. And we utilize the GCN to perform reasoning and infer the class of vertices for HASs without a large number of overlapping subgraphs. Our method captures the underlying relationship between expressions on the HASs and improves the accuracy and efficiency of FER. Experimental results on both the in-the-lab datasets and the in-the-wild datasets show that our method achieves higher recognition accuracy than several state-of-the-art methods. This highlights the benefit of the underlying relationship between expressions for FER.

Abstract:
Visual Commonsense Reasoning (VCR), deemed as one challenging extension of Visual Question Answering (VQA), endeavors to pursue a higher-level visual comprehension. VCR includes two complementary processes: question answering over a given image and rationale inference for answering explanation. Over the years, a variety of VCR methods have pushed more advancements on the benchmark dataset. Despite significance of these methods, they often treat the two processes in a separate manner and hence decompose VCR into two irrelevant VQA instances. As a result, the pivotal connection between question answering and rationale inference is broken, rendering existing efforts less faithful to visual reasoning. To empirically study this issue, we perform some in-depth empirical explorations in terms of both language shortcuts and generalization capability. Based on our findings, we then propose a plug-and-play knowledge distillation enhanced framework to couple the question answering and rationale inference processes. The key contribution lies in the introduction of a new branch, which serves as a relay to bridge the two processes. Given that our framework is model-agnostic, we apply it to the existing popular baselines and validate its effectiveness on the benchmark dataset. As demonstrated in the experimental results, when equipped with our method, these baselines all achieve consistent and significant performance improvements, evidently verifying the viability of processes coupling.

Abstract:
Perception-based image analysis technologies can be used to help visually impaired people take better quality pictures by providing automated guidance, thereby empowering them to interact more confidently on social media. The photographs taken by visually impaired users often suffer from one or both of two kinds of quality issues: technical quality (distortions), and semantic quality, such as framing and aesthetic composition. Here we develop tools to help them minimize occurrences of common technical distortions, such as blur, poor exposure, and noise. We do not address the complementary problems of semantic quality, leaving that aspect for future work. The problem of assessing, and providing actionable feedback on the technical quality of pictures captured by visually impaired users is hard enough, owing to the severe, commingled distortions that often occur. To advance progress on the problem of analyzing and measuring the technical quality of visually impaired user-generated content (VI-UGC), we built a very large and unique subjective image quality and distortion dataset. This new perceptual resource, which we call the LIVE-Meta VI-UGC Database, contains 40K real-world distorted VI-UGC images and 40K patches, on which we recorded 2.7M human perceptual quality judgments and 2.7M distortion labels. Using this psychometric resource we also created an automatic limited vision picture quality and distortion predictor that learns local-to-global spatial quality relationships, achieving state-of-the-art prediction performance on VI-UGC pictures, significantly outperforming existing picture quality models on this unique class of distorted picture data. We also created a prototype feedback system that helps to guide users to mitigate quality issues and take better quality pictures, by creating a multi-task learning framework. The dataset and models can be accessed at: https://github.com/mandal-cv/visimpaired.

Abstract:
Image classification for real-world applications often involves complicated data distributions such as fine-grained and long-tailed. To address the two challenging issues simultaneously, we propose a new regularization technique that yields an adversarial loss to strengthen the model learning. Specifically, for each training batch, we construct an adaptive batch prediction (ABP) matrix and establish its corresponding adaptive batch confusion norm (ABC-Norm). The ABP matrix is a composition of two parts, including an adaptive component to class-wise encode the imbalanced data distribution, and the other component to batch-wise assess the softmax predictions. The ABC-Norm leads to a norm-based regularization loss, which can be theoretically shown to be an upper bound for an objective function closely related to rank minimization. By coupling with the conventional cross-entropy loss, the ABC-Norm regularization could introduce adaptive classification confusion and thus trigger adversarial learning to improve the effectiveness of model learning. Different from most of state-of-the-art techniques in solving either fine-grained or long-tailed problems, our method is characterized with its simple and efficient design, and most distinctively, provides a unified solution. In the experiments, we compare ABC-Norm with relevant techniques and demonstrate its efficacy on several benchmark datasets, including (CUB-LT, iNaturalist2018); (CUB, CAR, AIR); and (ImageNet-LT), which respectively correspond to the real-world, fine-grained, and long-tailed scenarios.

Abstract:
Videos contain motions of various speeds. For example, the motions of one’s head and mouth differ in terms of speed — the head being relatively stable and the mouth moving rapidly as one speaks. Despite its diverse nature, previous video GANs generate video based on a single unified motion representation without considering the aspect of speed. In this paper, we propose a frequency-based motion representation for video GANs to realize the concept of speed in video generation process. In detail, we represent motions as continuous sinusoidal signals of various frequencies by introducing a coordinate-based motion generator. We show, in that case, frequency is highly related to the speed of motion. Based on this observation, we present frequency-aware weight modulation that enables manipulation of motions within a specific range of speed, which could not be achieved with the previous techniques. Extensive experiments validate that the proposed method outperforms state-of-the-art video GANs in terms of generation quality by its capability to model various speed of motions. Furthermore, we also show that our temporally continuous representation enables to further synthesize intermediate and future frames of generated videos.

Abstract:
Unsupervised domain adaptation has limitations when encountering label discrepancy between the source and target domains. While open-set domain adaptation approaches can address situations when the target domain has additional categories, these methods can only detect them but not further classify them. In this paper, we focus on a more challenging setting dubbed Domain Adaptive Zero-Shot Learning (DAZSL), which uses semantic embeddings of class tags as the bridge between seen and unseen classes to learn the classifier for recognizing all categories in the target domain when only the supervision of seen categories in the source domain is available. The main challenge of DAZSL is to perform knowledge transfer across categories and domain styles simultaneously. To this end, we propose a novel end-to-end learning mechanism dubbed Three-way Semantic Consistent Embedding (TSCE) to embed the source domain, target domain, and semantic space into a shared space. Specifically, TSCE learns domain-irrelevant categorical prototypes from the semantic embedding of class tags and uses them as the pivots of the shared space. The source domain features are aligned with the prototypes via their supervised information. On the other hand, the mutual information maximization mechanism is introduced to push the target domain features and prototypes towards each other. By this way, our approach can align domain differences between source and target images, as well as promote knowledge transfer towards unseen classes. Moreover, as there is no supervision in the target domain, the shared space may suffer from the catastrophic forgetting problem. Hence, we further propose a ranking-based embedding alignment mechanism to maintain the consistency between the semantic space and the shared space. Experimental results on both I2AwA and I2WebV clearly validate the effectiveness of our method. Code is available at https://github.com/tiggers23/TSCE-Domain-Adaptive-Zero-Shot-Learning.

Abstract:
Recently, most video-based person re-identification (Re-ID) methods adopt complex model or multi-scaled information to explore more discriminative spatio-temporal clues, thus achieving better retrieval accuracy. However, we witness that these approaches involve significant higher computation costs but only improve limited performances. Therefore, the overarching goal at this stage is to solve video Re-ID on the trade-off between accuracy and efficiency, thereby boosting the application in real scenarios. Frequency transform provides advantages of simplified representation, identification of hidden information and noise filtering in signal processing. Motivated by this, we treat the complex spatio-temporal feature as signal and convert it to frequency domain. By directly analyzing frequency clues, complex feature extraction procedures can be avoided. Specifically, this paper proposes a novel paradigm by categorizing video features into low/high and spatial/temporal frequency information. Then, with the help of 3D DCT, we theoretically establish the transform equivalence relationship between spatio-temporal domain and frequency domain. Finally, this paper proposes a simple and intuitive Frequency Information Disentanglement Network (FIDN) for video Re-ID. By extracting and applying both low and high frequency spatio-temporal features from a disentangling way, FIDN achieves comprehensive and discriminative video representation. Extensive experiments indicate that FIDN reaches the state-of-the-arts with only one convolution layer addition against baseline.

Abstract:
It has long been recognized that the standard convolution is not rotation equivariant and thus not appropriate for downside fisheye images which are rotationally symmetric. This paper introduces Rotational Convolution, a novel convolution that rotates the convolution kernel by characteristics of downside fisheye images. With the four rotation states of the convolution kernel, Rotational Convolution can be implemented on discrete signals. Rotational Convolution improves the performance of different networks in semantic segmentation and object detection markedly, harming the inference speed slightly. Finally, we demonstrate our methods’ numerical accuracy, computational efficiency, and effectiveness on the public segmentation dataset THEODORE and our self-built detection dataset SEU-fisheye. Our code is available at: https://github.com/wx19941204/Rotational-Convolution-for-downside-fisheye-images.

Abstract:
In this paper, we propose a scribble-based video colorization network with temporal aggregation called SVCNet. It can colorize monochrome videos based on different user-given color scribbles. It addresses three common issues in the scribble-based video colorization area: colorization vividness, temporal consistency, and color bleeding. To improve the colorization quality and strengthen the temporal consistency, we adopt two sequential sub-networks in SVCNet for precise colorization and temporal smoothing, respectively. The first stage includes a pyramid feature encoder to incorporate color scribbles with a grayscale frame, and a semantic feature encoder to extract semantics. The second stage finetunes the output from the first stage by aggregating the information of neighboring colorized frames (as short-range connections) and the first colorized frame (as a long-range connection). To alleviate the color bleeding artifacts, we learn video colorization and segmentation simultaneously. Furthermore, we set the majority of operations on a fixed small image resolution and use a Super-resolution Module at the tail of SVCNet to recover original sizes. It allows the SVCNet to fit different image resolutions at the inference. Finally, we evaluate the proposed SVCNet on DAVIS and Videvo benchmarks. The experimental results demonstrate that SVCNet produces both higher-quality and more temporally consistent videos than other well-known video colorization approaches. The codes and models can be found at https://github.com/zhaoyuzhi/SVCNet.

Abstract:
Semi-supervised dense prediction tasks, such as semantic segmentation, can be greatly improved through the use of contrastive learning. However, this approach presents two key challenges: selecting informative negative samples from a highly redundant pool and implementing effective data augmentation. To address these challenges, we present an adversarial contrastive learning method specifically for semi-supervised semantic segmentation. Direct learning of adversarial negatives is adopted to retain discriminative information from the past, leading to higher learning efficiency. Our approach also leverages an advanced data augmentation strategy called AdverseMix, which combines information from under-performing classes to generate more diverse and challenging samples. Additionally, we use auxiliary labels and classifiers to prevent over-adversarial negatives from affecting the learning process. Our experiments on the Pascal VOC and Cityscapes datasets demonstrate that our method outperforms the state-of-the-art by a significant margin, even when using a small fraction of labeled data.

Abstract:
Sketch is a well-researched topic in the vision community by now. Sketch semantic segmentation in particular, serves as a fundamental step towards finer-level sketch interpretation. Recent works use various means of extracting discriminative features from sketches and have achieved considerable improvements on segmentation accuracy. Common approaches for this include attending to the sketch-image as a whole, its stroke-level representation or the sequence information embedded in it. However, they mostly focus on only a part of such multi-facet information. In this paper, we for the first time demonstrate that there is complementary information to be explored across all these three facets of sketch data, and that segmentation performance consequently benefits as a result of such exploration of sketch-specific information. Specifically, we propose the Sketch-Segformer, a transformer-based framework for sketch semantic segmentation that inherently treats sketches as stroke sequences other than pixel-maps. In particular, Sketch-Segformer introduces two types of self-attention modules having similar structures that work with different receptive fields (i.e., whole sketch or individual stroke). The order embedding is then further synergized with spatial embeddings learned from the entire sketch as well as localized stroke-level information. Extensive experiments show that our sketch-specific design is not only able to obtain state-of-the-art performance on traditional figurative sketches (such as SPG, SketchSeg-150K datasets), but also performs well on creative sketches that do not conform to conventional object semantics (CreativeSketch dataset) thanks for our usage of multi-facet sketch information. Ablation studies, visualizations, and invariance tests further justifies our design choice and the effectiveness of Sketch-Segformer. Codes are available at https://github.com/PRIS-CV/Sketch-SF.

Abstract:
In this paper, we introduce a new algorithm based on archetypal analysis for blind hyperspectral unmixing, assuming linear mixing of endmembers. Archetypal analysis is a natural formulation for this task. This method does not require the presence of pure pixels (i.e., pixels containing a single material) but instead represents endmembers as convex combinations of a few pixels present in the original hyperspectral image. Our approach leverages an entropic gradient descent strategy, which (i) provides better solutions for hyperspectral unmixing than traditional archetypal analysis algorithms, and (ii) leads to efficient GPU implementations. Since running a single instance of our algorithm is fast, we also propose an ensembling mechanism along with an appropriate model selection procedure that make our method robust to hyper-parameter choices while keeping the computational complexity reasonable. By using six standard real datasets, we show that our approach outperforms state-of-the-art matrix factorization and recent deep learning methods. We also provide an open-source PyTorch implementation: https://github.com/inria-thoth/EDAA.

Abstract:
Existing low-light video enhancement methods are dominated by Convolution Neural Networks (CNNs) that are trained in a supervised manner. Due to the difficulty of collecting paired dynamic low/normal-light videos in real-world scenes, they are usually trained on synthetic, static, and uniform motion videos, which undermines their generalization to real-world scenes. Additionally, these methods typically suffer from temporal inconsistency (e.g., flickering artifacts and motion blurs) when handling large-scale motions since the local perception property of CNNs limits them to model long-range dependencies in both spatial and temporal domains. To address these problems, we propose the first unsupervised method for low-light video enhancement to our best knowledge, named LightenFormer, which models long-range intra- and inter-frame dependencies with a spatial-temporal co-attention transformer to enhance brightness while maintaining temporal consistency. Specifically, an effective but lightweight S-curve Estimation Network (SCENet) is first proposed to estimate pixel-wise S-shaped non-linear curves (S-curves) to adaptively adjust the dynamic range of an input video. Next, to model the temporal consistency of the video, we present a Spatial-Temporal Refinement Network (STRNet) to refine the enhanced video. The core module of STRNet is a novel Spatial-Temporal Co-attention Transformer (STCAT), which exploits multi-scale self- and cross-attention interactions to capture long-range correlations in both spatial and temporal domains among frames for implicit motion estimation. To achieve unsupervised training, we further propose two non-reference loss functions based on the invertibility of the S-curve and the noise independence among frames. Extensive experiments on the SDSD and LLIV-Phone datasets demonstrate that our LightenFormer outperforms state-of-the-art methods.

Abstract:
Video frame interpolation (VFI) aims to synthesize an intermediate frame between two consecutive frames. State-of-the-art approaches usually adopt a two-step solution, which includes 1) generating locally-warped pixels by calculating the optical flow based on pre-defined motion patterns (e.g., uniform motion, symmetric motion), 2) blending the warped pixels to form a full frame through deep neural synthesis networks. However, for various complicated motions (e.g., non-uniform motion, turn around), such improper assumptions about pre-defined motion patterns introduce the inconsistent warping from the two consecutive frames. This leads to the warped features for new frames are usually not aligned, yielding distortion and blur, especially when large and complex motions occur. To solve this issue, in this paper we propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI). In particular, we formulate the warped features with inconsistent motions as query tokens, and formulate relevant regions in a motion trajectory from two original consecutive frames into keys and values. Self-attention is learned on relevant tokens along the trajectory to blend the pristine features into intermediate frames through end-to-end training. Experimental results demonstrate that our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks. Both code and pre-trained models will be released at https://github.com/ChengxuLiu/TTVFI.

Abstract:
Graph convolutional networks have been widely applied in skeleton-based gait recognition. A key challenge in this task is to distinguish the individual walking styles of different subjects across various views. Existing state-of-the-art methods employ uniform convolutions to extract features from diverse sequences and ignore the effects of viewpoint changes. To overcome these limitations, we propose a condition-adaptive graph (CAG) convolution network that can dynamically adapt to the specific attributes of each skeleton sequence and the corresponding view angle. In contrast to using fixed weights for all joints and sequences, we introduce a joint-specific filter learning (JSFL) module in the CAG method, which produces sequence-adaptive filters at the joint level. The adaptive filters capture fine-grained patterns that are unique to each joint, enabling the extraction of diverse spatial-temporal information about body parts. Additionally, we design a view-adaptive topology learning (VATL) module that generates adaptive graph topologies. These graph topologies are used to correlate the joints adaptively according to the specific view conditions. Thus, CAG can simultaneously adjust to various walking styles and viewpoints. Experiments on the two most widely used datasets (i.e., CASIA-B and OU-MVLP) show that CAG surpasses all previous skeleton-based methods. Moreover, the recognition performance can be enhanced by simply combining CAG with appearance-based methods, demonstrating the ability of CAG to provide useful complementary information.

Abstract:
Although adversarial examples pose a serious threat to deep neural networks, most transferable adversarial attacks are ineffective against black-box defense models. This may lead to the mistaken belief that adversarial examples are not truly threatening. In this paper, we propose a novel transferable attack that can defeat a wide range of black-box defenses and highlight their security limitations. We identify two intrinsic reasons why current attacks may fail, namely data-dependency and network-overfitting. They provide a different perspective on improving the transferability of attacks. To mitigate the data-dependency effect, we propose the Data Erosion method. It involves finding special augmentation data that behave similarly in both vanilla models and defenses, to help attackers fool robustified models with higher chances. In addition, we introduce the Network Erosion method to overcome the network-overfitting dilemma. The idea is conceptually simple: it extends a single surrogate model to an ensemble structure with high diversity, resulting in more transferable adversarial examples. Two proposed methods can be integrated to further enhance the transferability, referred to as Erosion Attack (EA). We evaluate the proposed EA under different defenses that empirical results demonstrate the superiority of EA over existing transferable attacks and reveal the underlying threat to current robust models. The source code is publicly available at https://github.com/mesunhlf/EA.

Abstract:
Spectral super-resolution has attracted research attention recently, which aims to generate hyperspectral images from RGB images. However, most of the existing spectral super-resolution algorithms work in a supervised manner, requiring pairwise data for training, which is difficult to obtain. In this paper, we propose an Unmixing Guided Unsupervised Network (UnGUN), which does not require pairwise imagery to achieve unsupervised spectral super-resolution. In addition, UnGUN utilizes arbitrary other hyperspectral imagery as the guidance image to guide the reconstruction of spectral information. The UnGUN mainly includes three branches: two unmixing branches and a reconstruction branch. Hyperspectral unmixing branch and RGB unmixing branch decompose the guidance and RGB images into corresponding endmembers and abundances respectively, from which the spectral and spatial priors are extracted. Meanwhile, the reconstruction branch integrates the above spectral-spatial priors to generate a coarse hyperspectral image and then refined it. Besides, we design a discriminator to ensure that the distribution of generated image is close to the guidance hyperspectral imagery, so that the reconstructed image follows the characteristics of a real hyperspectral image. The major contribution is that we develop an unsupervised framework based on spectral unmixing, which realizes spectral super-resolution without paired hyperspectral-RGB images. Experiments demonstrate the superiority of UnGUN when compared with some SOTA methods.

Abstract:
Weakly supervised person search involves training a model with only bounding box annotations, without human-annotated identities. Clustering algorithms are commonly used to assign pseudo-labels to facilitate this task. However, inaccurate pseudo-labels and imbalanced identity distributions can result in severe label and sample noise. In this work, we propose a novel Collaborative Contrastive Refining (CCR) weakly-supervised framework for person search that jointly refines pseudo-labels and the sample-learning process with different contrastive strategies. Specifically, we adopt a hybrid contrastive strategy that leverages both visual and context clues to refine pseudo-labels, and leverage the sample-mining and noise-contrastive strategy to reduce the negative impact of imbalanced distributions by distinguishing positive samples and noise samples. Our method brings two main advantages: 1) it facilitates better clustering results for refining pseudo-labels by exploring the hybrid similarity; 2) it is better at distinguishing query samples and noise samples for refining the sample-learning process. Extensive experiments demonstrate the superiority of our approach over the state-of-the-art weakly supervised methods by a large margin (more than 3% mAP on CUHK-SYSU). Moreover, by leveraging more diverse unlabeled data, our method achieves comparable or even better performance than the state-of-the-art supervised methods.

Abstract:
For the long-term person re-identification (ReID) task, pedestrians are likely to change clothes, which poses a key challenge in overcoming drastic appearance variations caused by these cloth changes. However, analyzing how cloth changes influence identity-invariant representation learning is difficult. In this context, varying cloth-changed samples are not adaptively utilized, and their effects on the resulting features are overshadowed. To address these limitations, this paper aims to estimate the effect of cloth-changing patterns at both the image and feature levels, presenting a Dual-Level Adaptive Weighting (DLAW) solution. Specifically, at the image level, we propose an adaptive mining strategy to locate the cloth-changed regions for each identity. This strategy highlights the informative areas that have undergone changes, enhancing robustness against cloth variations. At the feature level, we estimate the degree of cloth-changing by modeling the correlation of part-level features and re-weighting identity-invariant feature components. This further eliminates the effects of cloth variations at the semantic body part level. Extensive experiments demonstrate that our method achieves promising performance on several cloth-changing datasets. Code and models are available at https: //github.com/fountaindream/DLAW.

Abstract:
In blurry images, the degree of image blurs may vary drastically due to different factors, such as varying speeds of shaking cameras and moving objects, as well as defects of the camera lens. However, current end-to-end models failed to explicitly take into account such diversity of blurs. This unawareness compromises the specialization at each blur level, yielding sub-optimal deblurred images as well as redundant post-processing. Therefore, how to specialize one model simultaneously at different blur levels, while still ensuring coverage and generalization, becomes an emerging challenge. In this work, we propose Ada-Deblur, a super-network that can be applied to a “broad spectrum” of blur levels with no re-training on novel blurs. To balance between individual blur level specialization and wide-range blur levels coverage, the key idea is to dynamically adapt the network architectures from a single well-trained super-network structure, targeting flexible image processing with different deblurring capacities at test time. Extensive experiments demonstrate that our work outperforms strong baselines by demonstrating better reconstruction accuracy while incurring minimal computational overhead. Besides, we show that our method is effective for both synthetic and realistic blurs compared to these baselines. The performance gap between our model and the state-of-the-art becomes more prominent when testing with unseen and strong blur levels. Specifically, our model demonstrates surprising deblurring performance on these images with PSNR improvements of around 1 dB. Our code is publicly available at https://github.com/wuqiuche/Ada-Deblur.

Abstract:
Depth data with a predominance of discriminative power in location is advantageous for accurate salient object detection (SOD). Existing RGBD SOD methods have focused on how to properly use depth information for complementary fusion with RGB data, having achieved great success. In this work, we attempt a far more ambitious use of the depth information by injecting the depth maps into the encoder in a single-stream model. Specifically, we propose a depth injection framework (DIF) equipped with an Injection Scheme (IS) and a Depth Injection Module (DIM). The proposed IS enhances the semantic representation of the RGB features in the encoder by directly injecting depth maps into the high-level encoder blocks, while helping our model maintain computational convenience. Our proposed DIM acts as a bridge between the depth maps and the hierarchical RGB features of the encoder and helps the information of two modalities complement and guide each other, contributing to a great fusion effect. Experimental results demonstrate that our proposed method can achieve state-of-the-art performance on six RGBD datasets. Moreover, our method can achieve excellent performance on RGBT SOD and our DIM can be easily applied to single-stream SOD models and the transformer architecture, proving a powerful generalization ability.

Abstract:
Visual Question Answering (VQA) is fundamentally compositional in nature, and many questions are simply answered by decomposing them into modular sub-problems. The recent proposed Neural Module Network (NMN) employ this strategy to question answering, whereas heavily rest with off-the-shelf layout parser or additional expert policy regarding the network architecture design instead of learning from the data. These strategies result in the unsatisfactory adaptability to the semantically-complicated variance of the inputs, thereby hindering the representational capacity and generalizability of the model. To tackle this problem, we propose a Semantic-aware modUlar caPsulE Routing framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics and refine the discriminative representations for prediction. Particularly, five powerful specialized modules as well as dynamic routers are tailored in each layer of the SUPER network, and the compact routing spaces are constructed such that a variety of customizable routes can be sufficiently exploited and the vision-semantic representations can be explicitly calibrated. We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets, as well as the parametric-efficient advantage. It is worth emphasizing that this work is not to pursue the state-of-the-art results in VQA. Instead, we expect that our model is responsible to provide a novel perspective towards architecture learning and representation calibration for VQA.

Abstract:
The goal of few-shot image recognition is to classify different categories with only one or a few training samples. Previous works of few-shot learning mainly focus on simple images, such as object or character images. Those works usually use a convolutional neural network (CNN) to learn the global image representations from training tasks, which are then adapted to novel tasks. However, there are many more abstract and complex images in real world, such as scene images, consisting of many object entities with flexible spatial relations among them. In such cases, global features can hardly obtain satisfactory generalization ability due to the large diversity of object relations in the scenes, which may hinder the adaptability to novel scenes. This paper proposes a composite object relation modeling method for few-shot scene recognition, capturing the spatial structural characteristic of scene images to enhance adaptability on novel scenes, considering that objects commonly co- occurred in different scenes. In different few-shot scene recognition tasks, the objects in the same images usually play different roles. Thus we propose a task-aware region selection module (TRSM) to further select the detected regions in different few-shot tasks. In addition to detecting object regions, we mainly focus on exploiting the relations between objects, which are more consistent to the scenes and can be used to cleave apart different scenes. Objects and relations are used to construct a graph in each image, which is then modeled with graph convolutional neural network. The graph modeling is jointly optimized with few-shot recognition, where the loss of few-shot learning is also capable of adjusting graph based representations. Typically, the proposed graph based representations can be plugged in different types of few-shot architectures, such as metric-based and meta-learning methods. Experimental results of few-shot scene recognition show the effectiveness of the proposed method.

Abstract:
With the rapid development of generative adversarial networks, face photo-sketch synthesis has achieved promising performance and playing an increasingly important role in law enforcement as well as entertainment. However, most of the existing methods only work under the condition of no interference, and lack of generalization ability in wild scenes. The fidelity of the images generated by the existing methods are insufficient, and the manipulation ability according to text description is unavailable. Directly applying existing text-based image manipulation methods on face photo-sketch scenario may lead to severe distortions due to the cross-domain challenges. Therefore, we propose a novel cross-domain face photo-sketch synthesis framework named HiFiSketch, a network that learns to adjust the weights of generators for high-fidelity synthesis and manipulation. It can realize the translation of images between the photo domain and the sketch domain, and modify results according to the text input in the meanwhile. We further propose a cross-domain loss function, which can effectively preserve facial details during face photo-sketch synthesis. Extensive experiments on four public face sketch datasets show the superiority of our method compared to existing methods. We further present text-based face photo-sketch manipulation and sequential face photo-sketch manipulation for the first time to demonstrate the effectiveness of our method on high fidelity face photo-sketch synthesis and manipulation.

Abstract:
Remarkable achievements have been obtained with binary neural networks (BNN) in real-time and energy-efficient single-image super-resolution (SISR) methods. However, existing approaches often adopt the Sign function to quantize image features while ignoring the influence of image spatial frequency. We argue that we can minimize the quantization error by considering different spatial frequency components. To achieve this, we propose a frequency-aware binarized network (FABNet) for single image super-resolution. First, we leverage the wavelet transformation to decompose the features into low-frequency and high-frequency components and then employ a “divide-and-conquer” strategy to separately process them with well-designed binary network structures. Additionally, we introduce a dynamic binarization process that incorporates learned-threshold binarization during forward propagation and dynamic approximation during backward propagation, effectively addressing the diverse spatial frequency information. Compared to existing methods, our approach is effective in reducing quantization error and recovering image textures. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed methods could surpass state-of-the-art approaches in terms of PSNR and visual quality with significantly reduced computational costs. Our codes are available at https://github.com/xrjiang527/FABNet-PyTorch.

Abstract:
Action Quality Assessment (AQA) plays an important role in video analysis, which is applied to evaluate the quality of specific actions, i.e., sports activities. However, it is still challenging because there are lots of small action discrepancies with similar backgrounds, but current approaches mostly adopt holistic video representations. So that fine-grained intra-class variations are unable to be captured. To address the aforementioned challenge, we propose a Fine-grained Spatio-temporal Parsing Network (FSPN) which is composed of the intra-sequence action parsing module and spatiotemporal multiscale transformer module to learn fine-grained spatiotemporal sub-action representations for more reliable AQA. The intra-sequence action parsing module performs semantical sub-action parsing by mining sub-actions at fine-grained levels. It enables a correct description of the subtle differences between action sequences. The spatiotemporal multiscale transformer module learns motion-oriented action features and obtains their long-range dependencies among sub-actions at different scales. Furthermore, we design a group contrastive loss to train the model and learn more discriminative feature representations for sub-actions without explicit supervision. We exhaustively evaluate our proposed approach in the FineDiving, AQA-7, and MTL-AQA datasets. Extensive experiment results demonstrate the effectiveness and feasibility of our proposed approach, which outperforms the state-of-the-art methods by a significant margin.

Abstract:
Transformer-based and interaction point-based methods have demonstrated promising performance and potential in human-object interaction detection. However, due to differences in structure and properties, direct integration of these two types of models is not feasible. Recent Transformer-based methods divide the decoder into two branches: an instance decoder for human-object pair detection and a classification decoder for interaction recognition. While the attention mechanism within the Transformer enhances the connection between localization and classification, this paper focuses on further improving HOI detection performance by increasing the intrinsic correlation between instance and action features. To address these challenges, this paper proposes a novel Transformer-based HOI Detection framework. In the proposed method, the decoder contains three parts: learnable query generator, instance decoder, and interaction classifier. The learnable query generator aims to build an effective query to guide the instance decoder and interaction classifier to learn more accurate instance and interaction features. These features are then applied to update the query generator for the next layer. Especially, inspired by the interaction point-based HOI and object detection methods, this paper introduces the prior bounding boxes, keypoints detection and spatial relation feature to build the novel learnable query generator. Finally, the proposed method is verified on HICO-DET and V-COCO datasets. The experimental results show that the proposed method has the better performance compared with the state-of-the-art methods.

Abstract:
Self-supervised space-time correspondence learning utilizing unlabeled videos holds great potential in computer vision. Most existing methods rely on contrastive learning with mining negative samples or adapting reconstruction from the image domain, which requires dense affinity across multiple frames or optical flow constraints. Moreover, video correspondence prediction models need to uncover more inherent properties of the video, such as structural information. In this work, we propose HiGraph+, a sophisticated space-time correspondence framework based on learnable graph kernels. By treating videos as a spatial-temporal graph, the learning objective of HiGraph+ is issued in a self-supervised manner, predicting the unobserved hidden graph via graph kernel methods. First, we learn the structural consistency of sub-graphs in graph-level correspondence learning. Furthermore, we introduce a spatio-temporal hidden graph loss through contrastive learning that facilitates learning temporal coherence across frames of sub-graphs and spatial diversity within the same frame. Therefore, we can predict long-term correspondences and drive the hidden graph to acquire distinct local structural representations. Then, we learn a refined representation across frames on the node-level via a dense graph kernel. The structural and temporal consistency of the graph forms the self-supervision of model training. HiGraph+ achieves excellent performance and demonstrates robustness in benchmark tests involving object, semantic part, keypoint, and instance labeling propagation tasks. Our algorithm implementations have been made publicly available at https://github.com/zyqin19/HiGraph.

Abstract:
Recently, distributed learning approaches have been studied for using data from multiple sources without sharing them, but they are not usually suitable in applications where each client carries out different tasks. Meanwhile, Transformer has been widely explored in computer vision area due to its capability to learn the common representation through global attention. By leveraging the advantages of Transformer, here we present a new distributed learning framework for multiple image processing tasks, allowing clients to learn distinct tasks with their local data. This arises from a disentangled representation of local and non-local features using a task-specific head/tail and a task-agnostic Vision Transformer. Each client learns a translation from its own task to a common representation using the task-specific networks, while the Transformer body on the server learns global attention between the features embedded in the representation. To enable decomposition between the task-specific and common representations, we propose an alternating training strategy between clients and server. Experimental results on distributed learning for various tasks show that our method synergistically improves the performance of each client with its own data.

Abstract:
We formulate a physics-informed compressed sensing (PICS) method for the reconstruction of velocity fields from noisy and sparse phase-contrast magnetic resonance signals. The method solves an inverse Navier-Stokes boundary value problem, which permits us to jointly reconstruct and segment the velocity field, and at the same time infer hidden quantities such as the hydrodynamic pressure and the wall shear stress. Using a Bayesian framework, we regularize the problem by introducing a priori information about the unknown parameters in the form of Gaussian random fields. This prior information is updated using the Navier-Stokes problem, an energy-based segmentation functional, and by requiring that the reconstruction is consistent with the k -space signals. We create an algorithm that solves this inverse problem, and test it for noisy and sparse k -space signals of the flow through a converging nozzle. We find that the method is capable of reconstructing and segmenting the velocity fields from sparsely-sampled (15% k -space coverage), low ( ～ 10 ) signal-to-noise ratio (SNR) signals, and that the reconstructed velocity field compares well with that derived from fully-sampled (100% k -space coverage) high ( > 40 ) SNR signals of the same flow.

Abstract:
Conventional stereoscopic displays suffer from vergence-accommodation conflict and cause visual fatigue. Integral-imaging-based displays resolve the problem by directly projecting the sub-aperture views of a light field into the eyes using a microlens array or a similar structure. However, such displays have an inherent trade-off between angular and spatial resolutions. In this paper, we propose a novel coded time-division multiplexing technique that projects encoded sub-aperture views to the eyes of a viewer with correct cues for vergence-accommodation reflex. Given sparse light field sub-aperture views, our pipeline can provide a perception of high-resolution refocused images with minimal aliasing by jointly optimizing the sub-aperture views for display and the coded aperture pattern. This is achieved via deep learning in an end-to-end fashion by simulating light transport and image formation with Fourier optics. To our knowledge, this work is among the first that optimize the light field display pipeline with deep learning. We verify our idea with objective image quality metrics (PSNR, SSIM, and LPIPS) and perform an extensive study on various customizable design variables in our display pipeline. Experimental results show that light fields displayed using the proposed technique indeed have higher quality than that of baseline display designs.

Abstract:
Learning-based infrared small object detection methods currently rely heavily on the classification backbone network. This tends to result in tiny object loss and feature distinguishability limitations as the network depth increases. Furthermore, small objects in infrared images are frequently emerged bright and dark, posing severe demands for obtaining precise object contrast information. For this reason, we in this paper propose a simple and effective “U-Net in U-Net” framework, UIU-Net for short, and detect small objects in infrared images. As the name suggests, UIU-Net embeds a tiny U-Net into a larger U-Net backbone, enabling the multi-level and multi-scale representation learning of objects. Moreover, UIU-Net can be trained from scratch, and the learned features can enhance global and local contrast information effectively. More specifically, the UIU-Net model is divided into two modules: the resolution-maintenance deep supervision (RM-DS) module and the interactive-cross attention (IC-A) module. RM-DS integrates Residual U-blocks into a deep supervision network to generate deep multi-scale resolution-maintenance features while learning global context information. Further, IC-A encodes the local context information between the low-level details and high-level semantic features. Extensive experiments conducted on two infrared single-frame image datasets, i.e., SIRST and Synthetic datasets, show the effectiveness and superiority of the proposed UIU-Net in comparison with several state-of-the-art infrared small object detection methods. The proposed UIU-Net also produces powerful generalization performance for video sequence infrared small object datasets, e.g., ATR ground/air video sequence dataset. The codes of this work are available openly at https://github.com/danfenghong/IEEE

Abstract:
The ability to capture joint connections in complicated motion is essential for skeleton-based action recognition. However, earlier approaches may not be able to fully explore this connection in either the spatial or temporal dimension due to fixed or single-level topological structures and insufficient temporal modeling. In this paper, we propose a novel multilevel spatial-temporal excited graph network (ML-STGNet) to address the above problems. In the spatial configuration, we decouple the learning of the human skeleton into general and individual graphs by designing a multilevel graph convolution (ML-GCN) network and a spatial data-driven excitation (SDE) module, respectively. ML-GCN leverages joint-level, part-level, and body-level graphs to comprehensively model the hierarchical relations of a human body. Based on this, SDE is further introduced to handle the diverse joint relations of different samples in a data-dependent way. This decoupling approach not only increases the flexibility of the model for graph construction but also enables the generality to adapt to various data samples. In the temporal configuration, we apply the concept of temporal difference to the human skeleton and design an efficient temporal motion excitation (TME) module to highlight the motion-sensitive features. Furthermore, a simplified multiscale temporal convolution (MS-TCN) network is introduced to enrich the expression ability of temporal features. Extensive experiments on the four popular datasets NTU-RGB+D, NTU-RGB+D 120, Kinetics Skeleton 400, and Toyota Smarthome demonstrate that ML-STGNet gains considerable improvements over the existing state of the art.

Abstract:
This article studies group-wise point set registration and makes the following contributions: “FuzzyGReg”, which is a new fuzzy cluster-based method to register multiple point sets jointly, and “FuzzyQA”, which is the associated quality assessment to check registration accuracy automatically. Given a group of point sets, FuzzyGReg creates a model of fuzzy clusters and equally treats all the point sets as the elements of the fuzzy clusters. Then, the group-wise registration is turned into a fuzzy clustering problem. To resolve this problem, FuzzyGReg applies a fuzzy clustering algorithm to identify the parameters of the fuzzy clusters while jointly transforming all the point sets to achieve an alignment. Next, based on the identified fuzzy clusters, FuzzyQA calculates the spatial properties of the transformed point sets and then checks the alignment accuracy by comparing the similarity degrees of the spatial properties of the point sets. When a local misalignment is detected, a local re-alignment is performed to improve accuracy. The proposed method is cost-efficient and convenient to be implemented. In addition, it provides reliable quality assessments in the absence of ground truth and user intervention. In the experiments, different point sets are used to test the proposed method and make comparisons with state-of-the-art registration techniques. The experimental results demonstrate the effectiveness of our method. The code is available at https://gitsvn-nt.oru.se/qianfang.liao/FuzzyGRegWithQA

Abstract:
Deep-learning-based local feature extraction algorithms that combine detection and description have made significant progress in visible image matching. However, the end-to-end training of such frameworks is notoriously unstable due to the lack of strong supervision of detection and the inappropriate coupling between detection and description. The problem is magnified in cross-modal scenarios, in which most methods heavily rely on the pre-training. In this paper, we recouple independent constraints of detection and description of multimodal feature learning with a mutual weighting strategy, in which the detected probabilities of robust features are forced to peak and repeat, while features with high detection scores are emphasized during optimization. Different from previous works, those weights are detached from back propagation so that the detected probability of indistinct features would not be directly suppressed and the training would be more stable. Moreover, we propose the Super Detector, a detector that possesses a large receptive field and is equipped with learnable non-maximum suppression layers, to fulfill the harsh terms of detection. Finally, we build a benchmark that contains cross visible, infrared, near-infrared and synthetic aperture radar image pairs for evaluating the performance of features in feature matching and image registration tasks. Extensive experiments demonstrate that features trained with the recoulped detection and description, named ReDFeat, surpass previous state-of-the-arts in the benchmark, while the model can be readily trained from scratch. The code is released at https://github.com/ACuOoOoO/ReDFeat.

Abstract:
Detecting moiré patterns in digital photographs is meaningful as it provides priors towards image quality evaluation and demoiréing tasks. In this paper, we present a simple yet efficient framework to extract moiré edge maps from images with moiré patterns. The framework includes a strategy for training triplet (natural image, moiré layer, and their synthetic mixture) generation, and a Moiré Pattern Detection Neural Network (MoireDet) for moiré edge map estimation. This strategy ensures consistent pixel-level alignments during training, accommodating characteristics of a diverse set of camera-captured screen images and real-world moiré patterns from natural images. The design of three encoders in MoireDet exploits both high-level contextual and low-level structural features of various moiré patterns. Through comprehensive experiments, we demonstrate the advantages of MoireDet: better identification precision of moiré images on two datasets, and a marked improvement over state-of-the-art demoiréing methods.

Abstract:
Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed cross-modal view-mixed transformer (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components.

Abstract:
Person re- identification (Re-ID) has become a hot research topic due to its widespread applications. Conducting person Re-ID in video sequences is a practical requirement, in which the crucial challenge is how to pursue a robust video representation based on spatial and temporal features. However, most of the previous methods only consider how to integrate part-level features in the spatio-temporal range, while how to model and generate the part-correlations is little exploited. In this paper, we propose a skeleton-based dynamic hypergraph framework, namely Skeletal Temporal Dynamic Hypergraph Neural Network (ST-DHGNN) for person Re-ID, which resorts to modeling the high-order correlations among various body parts based on a time series of skeletal information. Specifically, multi-shape and multi-scale patches are heuristically cropped from feature maps, constituting spatial representations in different frames. A joint-centered hypergraph and a bone-centered hypergraph are constructed in parallel from multiple body parts (i.e., head, trunk, and legs) with spatio-temporal multi-granularity in the entire video sequence, in which the graph vertices representing regional features and hyperedges denoting relationships. Dynamic hypergraph propagation containing the re- planning module and the hyperedge elimination module is proposed to better integrate features among vertices. Feature aggregation and attention mechanisms are also adopted to obtain a better video representation for person Re-ID. Experiments show that the proposed method performs significantly better than the state-of-the-art on three video-based person Re-ID datasets, including iLIDS-VID, PRID-2011, and MARS.

Abstract:
Human-Object Interaction (HOI) detection recognizes how persons interact with objects, which is advantageous in autonomous systems such as self-driving vehicles and collaborative robots. However, current HOI detectors are often plagued by model inefficiency and unreliability when making a prediction, which consequently limits its potential for real-world scenarios. In this paper, we address these challenges by proposing ERNet, an end-to-end trainable convolutional-transformer network for HOI detection. The proposed model employs an efficient multi-scale deformable attention to effectively capture vital HOI features. We also put forward a novel detection attention module to adaptively generate semantically rich instance and interaction tokens. These tokens undergo pre-emptive detections to produce initial region and vector proposals that also serve as queries which enhances the feature refinement process in the transformer decoders. Several impactful enhancements are also applied to improve the HOI representation learning. Additionally, we utilize a predictive uncertainty estimation framework in the instance and interaction classification heads to quantify the uncertainty behind each prediction. By doing so, we can accurately and reliably predict HOIs even under challenging scenarios. Experiment results on the HICO-Det, V-COCO, and HOI-A datasets demonstrate that the proposed model achieves state-of-the-art performance in detection accuracy and training efficiency. Codes are publicly available at https://github.com/Monash-CyPhi-AI-Research-Lab/ernet.

Abstract:
The scene classification of remote sensing (RS) images plays an essential role in the RS community, aiming to assign the semantics to different RS scenes. With the increase of spatial resolution of RS images, high-resolution RS (HRRS) image scene classification becomes a challenging task because the contents within HRRS images are diverse in type, various in scale, and massive in volume. Recently, deep convolution neural networks (DCNNs) provide the promising results of the HRRS scene classification. Most of them regard HRRS scene classification tasks as single-label problems. In this way, the semantics represented by the manual annotation decide the final classification results directly. Although it is feasible, the various semantics hidden in HRRS images are ignored, thus resulting in inaccurate decision. To overcome this limitation, we propose a semantic-aware graph network (SAGN) for HRRS images. SAGN consists of a dense feature pyramid network (DFPN), an adaptive semantic analysis module (ASAM), a dynamic graph feature update module, and a scene decision module (SDM). Their function is to extract the multi-scale information, mine the various semantics, exploit the unstructured relations between diverse semantics, and make the decision for HRRS scenes, respectively. Instead of transforming single-label problems into multi-label issues, our SAGN elaborates the proper methods to make full use of diverse semantics hidden in HRRS images to accomplish scene classification tasks. The extensive experiments are conducted on three popular HRRS scene data sets. Experimental results show the effectiveness of the proposed SAGN. Our source codes are available at https://github.com/TangXu-Group/SAGN.

Abstract:
High spatial resolution (HSR) remote sensing images contain complex foreground-background relationships, which makes the remote sensing land cover segmentation a special semantic segmentation task. The main challenges come from the large-scale variation, complex background samples and imbalanced foreground-background distribution. These issues make recent context modeling methods sub-optimal due to the lack of foreground saliency modeling. To handle these problems, we propose a Remote Sensing Segmentation framework (RSSFormer), including Adaptive TransFormer Fusion Module, Detail-aware Attention Layer and Foreground Saliency Guided Loss. Specifically, from the perspective of relation-based foreground saliency modeling, our Adaptive Transformer Fusion Module can adaptively suppress background noise and enhance object saliency when fusing multi-scale features. Then our Detail-aware Attention Layer extracts the detail and foreground-related information via the interplay of spatial attention and channel attention, which further enhances the foreground saliency. From the perspective of optimization-based foreground saliency modeling, our Foreground Saliency Guided Loss can guide the network to focus on hard samples with low foreground saliency responses to achieve balanced optimization. Experimental results on LoveDA datasets, Vaihingen datasets, Potsdam datasets and iSAID datasets validate that our method outperforms existing general semantic segmentation methods and remote sensing segmentation methods, and achieves a good compromise between computational overhead and accuracy. Our code is available at https://github.com/Rongtao-Xu/RepresentationLearning/tree/main/RSSFormer-TIP2023.

Abstract:
Although multispectral and hyperspectral imaging acquisitions are applied in numerous fields, the existing spectral imaging systems suffer from either low temporal or spatial resolution. In this study, a new multispectral imaging system—camera array based multispectral super resolution imaging system (CAMSRIS) is proposed that can simultaneously achieve multispectral imaging with high temporal and spatial resolutions. The proposed registration algorithm is used to align pairs of different peripheral and central view images. A novel, super-resolution, spectral-clustering-based image reconstruction algorithm was developed for the proposed CAMSRIS to improve the spatial resolution of the acquired images and preserve the exact spectral information without introducing false information. The reconstructed results showed that the spatial and spectral quality and operational efficiency of the proposed system are better than those of a multispectral filter array (MSFA) based on different multispectral datasets. The PSNR of the multispectral super-resolution images obtained by the proposed method were respectively higher by 2.03 and 1.93 dB than those of GAP-TV and DeSCI, and the execution time was significantly shortened by approximately 54.55 s and 9820.19 s when the CAMSI dataset was used. The feasibility of the proposed system was verified in practical applications based on different scenes captured by the self-built system.

Abstract:
Weakly supervised temporal action localization (WS-TAL) aims to identify the time intervals corresponding to actions of interest in untrimmed videos with video-level weak supervision. For most existing WS-TAL methods, two commonly encountered challenges are under-localization and over-localization, which inevitably bring about severe performance deterioration. To address the issues, this paper proposes a transformer-structured stochastic process modeling framework, namely StochasticFormer, to fully investigate finer-grained interactions among the intermediate predictions to achieve further refined localization. StochasticFormer is built on a standard attention-based pipeline to derive preliminary frame/snippet-level predictions. Then, the pseudo localization module generates variable-length pseudo action instances with the corresponding pseudo labels. Using the pseudo “action instance - action category” pairs as fine-grained pseudo supervision, the stochastic modeler aims to learn the underlying interaction among the intermediate predictions with an encoder-decoder network. The encoder consists of the deterministic and latent path to capture the local and global information, which are subsequently integrated by the decoder to obtain reliable predictions. The framework is optimized with three carefully designed losses, i.e. the video-level classification loss, the frame-level semantic coherence loss, and the ELBO loss. Extensive experiments on two benchmarks, i.e., THUMOS14 and ActivityNet1.2, have shown the efficacy of StochasticFormer compared with the state-of-the-art methods.

Abstract:
Multimodal remote sensing (RS) image segmentation aims to comprehensively utilize multiple RS modalities to assign pixel-level semantics to the studied scenes, which can provide a new perspective for global city understanding. Multimodal segmentation inevitably encounters the challenge of modeling intra- and inter-modal relationships, i.e ., object diversity and modal gaps. However, the previous methods are usually designed for a single RS modality, limited by the noisy collection environment and poor discrimination information. Neuropsychology and neuroanatomy confirm that the human brain performs the guiding perception and integrative cognition of multimodal semantics through intuitive reasoning. Therefore, establishing a semantic understanding framework inspired by intuition to realize multimodal RS segmentation becomes the main motivation of this work. Drived by the superiority of hypergraphs in modeling high-order relationships, we propose an intuition-inspired hypergraph network ( I^2HN ) for multimodal RS segmentation. Specifically, we present a hypergraph parser to imitate guiding perception to learn intra-modal object-wise relationships. It parses the input modality into irregular hypergraphs to mine semantic clues and generate robust mono-modal representations. In addition, we also design a hypergraph matcher to dynamically update the hypergraph structure from the explicit correspondence of visual concepts, similar to integrative cognition, to improve cross-modal compatibility when fusing multimodal features. Extensive experiments on two multimodal RS datasets show that the proposed I^2HN outperforms the state-of-the-art models, achieving F1/mIoU accuracy 91.4%/82.9% on the ISPRS Vaihingen dataset, and 92.1%/84.2% on the MSAW dataset.

Abstract:
Currently, cross-scene hyperspectral image (HSI) classification has drawn increasing attention. It is necessary to train a model only on source domain (SD) and directly transferring the model to target domain (TD), when TD needs to be processed in real time and cannot be reused for training. Based on the idea of domain generalization, a Single-source Domain Expansion Network (SDEnet) is developed to ensure the reliability and effectiveness of domain extension. The method uses generative adversarial learning to train in SD and test in TD. A generator including semantic encoder and morph encoder is designed to generate the extended domain (ED) based on encoder-randomization-decoder architecture, where spatial randomization and spectral randomization are specifically used to generate variable spatial and spectral information, and the morphological knowledge is implicitly applied as domain invariant information during domain expansion. Furthermore, the supervised contrastive learning is employed in the discriminator to learn class-wise domain invariant representation, which drives intra-class samples of SD and ED. Meanwhile, adversarial training is designed to optimize the generator to drive intra-class samples of SD and ED to be separated. Extensive experiments on two public HSI datasets and one additional multispectral image (MSI) dataset demonstrate the superiority of the proposed method when compared with state-of-the-art techniques. The codes will be available from the website:https://github.com/YuxiangZhang-BIT/IEEE_TIP_SDEnet.

Abstract:
Faithful measurement of perceptual quality is of significant importance to various multimedia applications. By fully utilizing reference images, full-reference image quality assessment (FR-IQA) methods usually achieve better prediction performance. On the other hand, no-reference image quality assessment (NR-IQA), also known as blind image quality assessment (BIQA), which does not consider the reference image, makes it a challenging but important task. Previous NR-IQA methods have focused on spatial measures at the expense of information in the available frequency bands. In this paper, we present a multiscale deep blind image quality assessment method (BIQA, M.D.) with spatial optimal-scale filtering analysis. Motivated by the multi-channel behavior of the human visual system and contrast sensitivity function, we decompose an image into a number of spatial frequency bands through multiscale filtering and extract features to map an image to its subjective quality score by applying convolutional neural network. Experimental results show that BIQA, M.D. compares well with existing NR-IQA methods and generalizes well across datasets.

Abstract:
Point cloud registration is a popular topic that has been widely used in 3D model reconstruction, location, and retrieval. In this paper, we propose a new registration method, KSS-ICP, to address the rigid registration task in Kendall shape space (KSS) with Iterative Closest Point (ICP). The KSS is a quotient space that removes influences of translations, scales, and rotations for shape feature-based analysis. Such influences can be concluded as the similarity transformations that do not change the shape feature. The point cloud representation in KSS is invariant to similarity transformations. We utilize such property to design the KSS-ICP for point cloud registration. To tackle the difficulty to achieve the KSS representation in general, the proposed KSS-ICP formulates a practical solution that does not require complex feature analysis, data training, and optimization. With a simple implementation, KSS-ICP achieves more accurate registration from point clouds. It is robust to similarity transformation, non-uniform density, noise, and defective parts. Experiments show that KSS-ICP has better performance than the state-of-the-art. Code (vvvwo/KSS-ICP) and executable files (vvvwo/KSS-ICP/tree/master/EXE) are made public.

Abstract:
Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds. With the advances of deep learning, CNN-based methods have yielded promising results in generic object detection due to their powerful modeling capability. However, existing CNN-based methods cannot be directly applied to infrared small targets since pooling layers in their networks could lead to the loss of targets in deep layers. To handle this problem, we propose a dense nested attention network (DNA-Net) in this paper. Specifically, we design a dense nested interactive module (DNIM) to achieve progressive interaction among high-level and low-level features. With the repetitive interaction in DNIM, the information of infrared small targets in deep layers can be maintained. Based on DNIM, we further propose a cascaded channel and spatial attention module (CSAM) to adaptively enhance multi-level features. With our DNA-Net, contextual information of small targets can be well incorporated and fully exploited by repetitive fusion and enhancement. Moreover, we develop an infrared small target dataset (namely, NUDT-SIRST) and propose a set of evaluation metrics to conduct comprehensive performance evaluation. Experiments on both public and our self-developed datasets demonstrate the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of probability of detection ( P_d ), false-alarm rate ( F_a ), and intersection of union ( IoU ).

Abstract:
With the increasing demand of compressing and streaming 3D point clouds under constrained bandwidth, it has become ever more important to accurately and efficiently determine the quality of compressed point clouds, so as to assess and optimize the quality-of-experience (QoE) of end users. Here we make one of the first attempts developing a bitstream-based no-reference (NR) model for perceptual quality assessment of point clouds without resorting to full decoding of the compressed data stream. Specifically, we first establish a relationship between texture complexity and the bitrate and texture quantization parameters based on an empirical rate-distortion model. We then construct a texture distortion assessment model upon texture complexity and quantization parameters. By combining this texture distortion model with a geometric distortion model derived from Trisoup geometry encoding parameters, we obtain an overall bitstream-based NR point cloud quality model named streamPCQ. Experimental results show that the proposed streamPCQ model demonstrates highly competitive performance when compared with existing classic full-reference (FR) and reduced-reference (RR) point cloud quality assessment methods with a fraction of computational cost.

Abstract:
We address the one-class classification (OCC) problem and advocate a one-class MKL (multiple kernel learning) approach for this purpose. To this aim, based on the Fisher null-space OCC principle, we present a multiple kernel learning algorithm where an \ell _p -norm regularisation ( p \geq 1 ) is considered for kernel weight learning. We cast the proposed one-class MKL problem as a min-max saddle point Lagrangian optimisation task and propose an efficient approach to optimise it. An extension of the proposed approach is also considered where several related one-class MKL tasks are learned concurrently by constraining them to share common weights for kernels. An extensive evaluation of the proposed MKL approach on a range of data sets from different application domains confirms its merits against the baseline and several other algorithms.

Abstract:
With the increasing spectral dimension of hyperspectral images (HSI), how correctly choose bands based on band correlation and information has become more significant, but also complicated. Band selection is a combinatorial optimization problem, and intelligent optimization algorithms have been shown to be crucial in solving combinatorial optimization problems. However, major of them only use a single objective as the selection index, while neglecting the overall features of hyperspectral images, which may lead to inaccuracy in object detection. To tackle this, we propose a band selection method based on a multi-objective cuckoo search algorithm (MOCS) when constructing a multi-objective unsupervised band selection model based on the amount of information and correlation of the bands (MOCS-BS). Specifically, an adaptive strategy based on population crowding degree is first proposed to assist Lévy flight in overcoming the influence of the parameter constancy. Then, an information-sharing strategy based on grouping and crossover is designed to balance the search ability between global exploration and local exploitation, which can overcome the shortcomings caused by the lack of information interaction between individuals. Finally, the HSI classification experiments are performed by Random Forest and KNN classifiers based on the subset of bands selected by the proposed MOCS-BS method. The proposed method is compared with state-of-the-art algorithms including neighborhood grouping normalized matched filter (NGNMF) and multi-objective artificial bee colony with band selection (MABC-BS) on four HSI datasets. The experimental results demonstrate that MOCS-BS is more effective and robust than other methods.

Affiliations: IVIP Lab, Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Communication and Information Engineering, Shanghai University, Shanghai, China; School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China; Department of Mathematics, Center for Mathematical Artificial Intelligence, The Chinese University of Hong Kong, Hong Kong, China; Research Center for Industries of the Future and the School of Engineering, Westlake University, Hangzhou, China

Abstract:
Recently, deep convolution neural networks (CNNs) steered face super-resolution methods have achieved great progress in restoring degraded facial details by joint training with facial priors. However, these methods have some obvious limitations. On the one hand, multi-task joint learning requires additional marking on the dataset, and the introduced prior network will significantly increase the computational cost of the model. On the other hand, the limited receptive field of CNN will reduce the fidelity and naturalness of the reconstructed facial images, resulting in suboptimal reconstructed images. In this work, we propose an efficient CNN-Transformer Cooperation Network (CTCNet) for face super-resolution tasks, which uses the multi-scale connected encoder-decoder architecture as the backbone. Specifically, we first devise a novel Local-Global Feature Cooperation Module (LGCM), which is composed of a Facial Structure Attention Unit (FSAU) and a Transformer block, to promote the consistency of local facial detail and global facial structure restoration simultaneously. Then, we design an efficient Feature Refinement Module (FRM) to enhance the encoded features. Finally, to further improve the restoration of fine facial details, we present a Multi-scale Feature Fusion Unit (MFFU) to adaptively fuse the features from different stages in the encoder procedure. Extensive evaluations on various datasets have assessed that the proposed CTCNet can outperform other state-of-the-art methods significantly. Source code will be available at https://github.com/IVIPLab/CTCNet.

Abstract:
As a branch of transfer learning, domain adaptation leverages useful knowledge from a source domain to a target domain for solving target tasks. Most of the existing domain adaptation methods focus on how to diminish the conditional distribution shift and learn invariant features between different domains. However, two important factors are overlooked by most existing methods: 1) the transferred features should be not only domain invariant but also discriminative and correlated, and 2) negative transfer should be avoided as much as possible for the target tasks. To fully consider these factors in domain adaptation, we propose a guided discrimination and correlation subspace learning (GDCSL) method for cross-domain image classification. GDCSL considers the domain-invariant, category-discriminative, and correlation learning of data. Specifically, GDCSL introduces the discriminative information associated with the source and target data by minimizing the intraclass scatter and maximizing the interclass distance. By designing a new correlation term, GDCSL extracts the most correlated features from the source and target domains for image classification. The global structure of the data can be preserved in GDCSL because the target samples are represented by the source samples. To avoid negative transfer issues, we use a sample reweighting method to detect target samples with different confidence levels. A semi-supervised extension of GDCSL (Semi-GDCSL) is also proposed, and a novel label selection scheme is introduced to ensure the correction of the target pseudo-labels. Comprehensive and extensive experiments are conducted on several cross-domain data benchmarks. The experimental results verify the effectiveness of the proposed methods over state-of-the-art domain adaptation methods.

Abstract:
In this paper, an Adaptive Fusion Transformer (AFT) is proposed for unsupervised pixel-level fusion of visible and infrared images. Different from the existing convolutional networks, transformer is adopted to model the relationship of multi-modality images and explore cross-modal interactions in AFT. The encoder of AFT uses a Multi-Head Self-attention (MSA) module and Feed Forward (FF) network for feature extraction. Then, a Multi-head Self-Fusion (MSF) module is designed for the adaptive perceptual fusion of the features. By sequentially stacking the MSF, MSA, and FF, a fusion decoder is constructed to gradually locate complementary features for recovering informative images. In addition, a structure-preserving loss is defined to enhance the visual quality of fused images. Extensive experiments are conducted on several datasets to compare our proposed AFT method with 21 popular approaches. The results show that AFT has state-of-the-art performance in both quantitative metrics and visual perception.

Abstract:
RGB-D saliency detection aims to fuse multi-modal cues to accurately localize salient regions. Existing works often adopt attention modules for feature modeling, with few methods explicitly leveraging fine-grained details to merge with semantic cues. Thus, despite the auxiliary depth information, it is still challenging for existing models to distinguish objects with similar appearances but at distinct camera distances. In this paper, from a new perspective, we propose a novel Hierarchical Depth Awareness network (HiDAnet) for RGB-D saliency detection. Our motivation comes from the observation that the multi-granularity properties of geometric priors correlate well with the neural network hierarchies. To realize multi-modal and multi-level fusion, we first use a granularity-based attention scheme to strengthen the discriminatory power of RGB and depth features separately. Then we introduce a unified cross dual-attention module for multi-modal and multi-level fusion in a coarse-to-fine manner. The encoded multi-modal features are gradually aggregated into a shared decoder. Further, we exploit a multi-scale loss to take full advantage of the hierarchical information. Extensive experiments on challenging benchmark datasets demonstrate that our HiDAnet performs favorably over the state-of-the-art methods by large margins. The source code can be found in https://github.com/Zongwei97/HIDANet/.

Abstract:
Semi-supervised learning has been well established in the area of image classification but remains to be explored in video-based action recognition. FixMatch is a state-of-the-art semi-supervised method for image classification, but it does not work well when transferred directly to the video domain since it only utilizes the single RGB modality, which contains insufficient motion information. Moreover, it only leverages highly-confident pseudo-labels to explore consistency between strongly-augmented and weakly-augmented samples, resulting in limited supervised signals, long training time, and insufficient feature discriminability. To address the above issues, we propose neighbor-guided consistent and contrastive learning (NCCL), which takes both RGB and temporal gradient (TG) as input and is based on the teacher-student framework. Due to the limitation of labelled samples, we first incorporate neighbors information as a self-supervised signal to explore the consistent property, which compensates for the lack of supervised signals and the shortcoming of long training time of FixMatch. To learn more discriminative feature representations, we further propose a novel neighbor-guided category-level contrastive learning term to minimize the intra-class distance and enlarge the inter-class distance. We conduct extensive experiments on four datasets to validate the effectiveness. Compared with the state-of-the-art methods, our proposed NCCL achieves superior performance with much lower computational cost.

Abstract:
Recently, clustering-based methods have been the dominant solution for unsupervised person re-identification (ReID). Memory-based contrastive learning is widely used for its effectiveness in unsupervised representation learning. However, we find that the inaccurate cluster proxies and the momentum updating strategy do harm to the contrastive learning system. In this paper, we propose a real-time memory updating strategy (RTMem) to update the cluster centroid with a randomly sampled instance feature in the current mini-batch without momentum. Compared to the method that calculates the mean feature vectors as the cluster centroid and updating it with momentum, RTMem enables the features to be up-to-date for each cluster. Based on RTMem, we propose two contrastive losses, i.e., sample-to-instance and sample-to-cluster, to align the relationships between samples to each cluster and to all outliers not belonging to any other clusters. On the one hand, sample-to-instance loss explores the sample relationships of the whole dataset to enhance the capability of density-based clustering algorithm, which relies on similarity measurement for the instance-level images. On the other hand, with pseudo-labels generated by the density-based clustering algorithm, sample-to-cluster loss enforces the sample to be close to its cluster proxy while being far from other proxies. With the simple RTMem contrastive learning strategy, the performance of the corresponding baseline is improved by 9.3% on Market-1501 dataset. Our method consistently outperforms state-of-the-art unsupervised learning person ReID methods on three benchmark datasets. Code is made available at:https://github.com/PRIS-CV/RTMem.

Abstract:
Zero-shot video object segmentation (ZS-VOS) aims to segment foreground objects in a video sequence without prior knowledge of these objects. However, existing ZS-VOS methods often struggle to distinguish between foreground and background or to keep track of the foreground in complex scenarios. The common practice of introducing motion information, such as optical flow, can lead to overreliance on optical flow estimation. To address these challenges, we propose an encoder-decoder-based hierarchical co-attention propagation network (HCPN) capable of tracking and segmenting objects. Specifically, our model is built upon multiple collaborative evolutions of the parallel co-attention module (PCM) and the cross co-attention module (CCM). PCM captures common foreground regions among adjacent appearance and motion features, while CCM further exploits and fuses cross-modal motion features returned by PCM. Our method is progressively trained to achieve hierarchical spatio-temporal feature propagation across the entire video. Experimental results demonstrate that our HCPN outperforms all previous methods on public benchmarks, showcasing its effectiveness for ZS-VOS. Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HCPN.

Abstract:
Learning radiance fields has shown remarkable results for novel view synthesis. The learning procedure usually costs lots of time, which motivates the latest methods to speed up the learning procedure by learning without neural networks or using more efficient data structures. However, these specially designed approaches do not work for most of radiance fields based methods. To resolve this issue, we introduce a general strategy to speed up the learning procedure for almost all radiance fields based methods. Our key idea is to reduce the redundancy by shooting much fewer rays in the multi-view volume rendering procedure which is the base for almost all radiance fields based methods. We find that shooting rays at pixels with dramatic color change not only significantly reduces the training burden but also barely affects the accuracy of the learned radiance fields. In addition, we also adaptively subdivide each view into a quadtree according to the average rendering error in each node in the tree, which makes us dynamically shoot more rays in more complex regions with larger rendering error. We evaluate our method with different radiance fields based methods under the widely used benchmarks. Experimental results show that our method achieves comparable accuracy to the state-of-the-art with much faster training.

Abstract:
Multi-view action recognition aims to identify action categories from given clues. Existing studies ignore the negative influences of fuzzy views between view and action in disentangling, commonly arising the mistaken recognition results. To this end, we regard the observed image as the composition of the view and action components, and give full play to the advantages of multiple views via the adaptive cooperative representation among these two components, forming a Dual-Recommendation Disentanglement Network (DRDN) for multi-view action recognition. Specifically, 1) For the action, we leverage a multi-level Specific Information Recommendation (SIR) to enhance the interaction among intricate activities and views. SIR offers a more comprehensive representation of activities, measuring the trade-off between global and local information. 2) For the view, we utilize a Pyramid Dynamic Recommendation (PDR) to learn a complete and detailed global representation by transferring features from different views. It is explicitly restricted to resist the fuzzy noise influence, focusing on positive knowledge from other views. Our DRDN aims for complete action and view representation, where PDR directly guides action to disentangle with view features and SIR considers mutual exclusivity of view and action clues. Extensive experiments have indicated that the multi-view action recognition method DRDN we proposed achieves state-of-the-art performance over powerful competitors on several standard benchmarks. The code will be available at https://github.com/51cloud/DRDN.

Abstract:
Point cloud shape correspondence aims at accurately mapping one point cloud to another point cloud with various 3D shapes. Since point clouds are usually sparse, disordered, irregular, and with diverse shapes, it is challenging to learn consistent point cloud representations and achieve the accurate matching of different point cloud shapes. To address the above issues, we propose a Hierarchical Shape-consistent TRansformer for unsupervised point cloud shape correspondence (HSTR), including a multi-receptive-field point representation encoder and a shape-consistent constrained module in a unified architecture. The proposed HSTR enjoys several merits. In the multi-receptive-field point representation encoder, we set progressively larger receptive fields in different blocks to simultaneously consider the local structure and the long-range context. In the shape-consistent constrained module, we design two novel shape selective whitening losses, which can complement each other to achieve suppression of features sensitive to shape change. Extensive experimental results on four standard benchmarks demonstrate the superiority and generalization ability of our approach to existing methods at the similar model scale, and our method achieves the new state-of-the-art results.

Abstract:
Monocular 3D object detection has drawn increasing attention in various human-related applications, such as autonomous vehicles, due to its cost-effective property. On the other hand, a monocular image alone inherently contains insufficient information to infer the 3D information. In this paper, we propose a new monocular 3D object detector that can recall the stereoscopic visual information about an object, given a left-view monocular image. Here, we devise a location embedding module to handle each object by being aware of its location. Next, given the object appearance of the left-view monocular image, we devise Monocular-to-Stereoscopic (M2S) memory that can recall the object appearance of the right-view and depth information. For this purpose, we introduce a stereoscopic vision memorizing loss that guides the M2S memory to store the stereoscopic visual information. Furthermore, we propose a binocular vision association loss to guide the M2S memory that can associate the information of the left-right view about the object when estimating the depth. As a result, our monocular 3D object detector with the M2S memory can effectively exploit the recalled stereoscopic visual information in the inference phase. The comprehensive experimental results on two public datasets, KITTI 3D Object Detection Benchmark and Waymo Open Dataset, demonstrate the effectiveness of the proposed method. We claim that our method is a step-forward method that follows the behaviors of humans that can recall the stereoscopic visual information even when one eye is closed.

Abstract:
We present optimized modulation and coding for the recently introduced dual modulated QR (DMQR) codes that extend traditional QR codes to carry additional secondary data in the orientation of elliptical dots that replace black modules in the barcode images. By dynamically adjusting the dot size, we realize gains in embedding strength for both the intensity modulation and the orientation modulation that carry the primary and secondary data, respectively. Furthermore, we develop a model for the coding channel for the secondary data that enables soft-decoding via 5G NR (new radio) codes already supported by mobile devices. The performance gains for the proposed optimized designs are characterized via theoretical analysis, simulations, and actual experiments using smartphone devices. The theoretical analysis and simulations inform our design choices for the modulation and coding, and the experiments characterize the overall improvement in performance for the optimized design over the prior unoptimized designs. Importantly, the optimized designs significantly increase usability of DMQR codes with commonly used QR code beautification that cannibalizes a portion of the barcode image area for the insertion of a logo or image. In experiments with a capture distance of 15 inches, the optimized designs increase the decoding success rates between 10% and 32% for the secondary data while also providing gains for primary data decoding at larger capture distances. When used with beautification in typical settings, the secondary message is decoded with a high success rate for the proposed optimized designs, whereas it invariably fails for the prior unoptimized designs.

Abstract:
Impressive advances in acquisition and sharing technologies have made the growth of multimedia collections and their applications almost unlimited. However, the opposite is true for the availability of labeled data, which is needed for supervised training, since such data is often expensive and time-consuming to obtain. While there is a pressing need for the development of effective retrieval and classification methods, the difficulties faced by supervised approaches highlight the relevance of methods capable of operating with few or no labeled data. In this work, we propose a novel manifold learning algorithm named Rank Flow Embedding (RFE) for unsupervised and semi-supervised scenarios. The proposed method is based on ideas recently exploited by manifold learning approaches, which include hypergraphs, Cartesian products, and connected components. The algorithm computes context-sensitive embeddings, which are refined following a rank-based processing flow, while complementary contextual information is incorporated. The generated embeddings can be exploited for more effective unsupervised retrieval or semi-supervised classification based on Graph Convolutional Networks. Experimental results were conducted on 10 different collections. Various features were considered, including the ones obtained with recent Convolutional Neural Networks (CNN) and Vision Transformer (ViT) models. High effective results demonstrate the effectiveness of the proposed method on different tasks: unsupervised image retrieval, semi-supervised classification, and person Re-ID. The results demonstrate that RFE is competitive or superior to the state-of-the-art in diverse evaluated scenarios.

Abstract:
In this paper we propose novel extensions to JPEG 2000 for the coding of discontinuous media which includes piecewise smooth imagery such as depth maps and optical flows. These extensions use breakpoints to model discontinuity boundary geometry and apply a breakpoint dependent Discrete Wavelet Transform (BP-DWT) to the input imagery. The highly scalable and accessible coding features provided by the JPEG 2000 compression framework are preserved by our proposed extensions, with the breakpoint and transform components encoded as independent bit streams that can be progressively decoded. Comparative rate-distortion results are provided along with corresponding visual examples which highlight the advantages of using breakpoint representations with accompanying BD-DWT and embedded bit-plane coding. Recently our proposed extensions have been adopted and are in the process of being published as a new Part 17 to the JPEG 2000 family of coding standards.

Affiliations: Faculty of Information Science and Engineering, Ocean University of China, Qingdao, China; Department of Computer Science, University of Manitoba, Winnipeg, Canada; School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China; School of Nursing, The Hong Kong Polytechnic University, Kowloon, Hong Kong; Department of Computing Science, University of Alberta, Edmonton, Canada; School of Information Science and Technology, University of Science and Technology of China, Hefei, China; School of Data Science, The Chinese University of Hong Kong, Shenzhen, China

Abstract:
Transformer-based architectures start to emerge in single image super resolution (SISR) and have achieved promising performance. However, most existing vision Transformer-based SISR methods still have two shortcomings: (1) they divide images into the same number of patches with a fixed size, which may not be optimal for restoring patches with different levels of texture richness; and (2) their position encodings treat all input tokens equally and hence, neglect the dependencies among them. This paper presents a HIPA, which stands for a novel Transformer architecture that progressively recovers the high resolution image using a hierarchical patch partition. Specifically, we build a cascaded model that processes an input image in multiple stages, where we start with tokens with small patch sizes and gradually merge them to form the full resolution. Such a hierarchical patch mechanism not only explicitly enables feature aggregation at multiple resolutions but also adaptively learns patch-aware features for different image regions, e.g., using a smaller patch for areas with fine details and a larger patch for textureless regions. Meanwhile, a new attention-based position encoding scheme for Transformer is proposed to let the network focus on which tokens should be paid more attention by assigning different weights to different tokens, which is the first time to our best knowledge. Furthermore, we also propose a multi-receptive field attention module to enlarge the convolution receptive field from different branches. The experimental results on several public datasets demonstrate the superior performance of the proposed HIPA over previous methods quantitatively and qualitatively. We will share our code and models when the paper is accepted.

Abstract:
Facial action unit (AU) detection is challenging due to the difficulty in capturing correlated information from subtle and dynamic AUs. Existing methods often resort to the localization of correlated regions of AUs, in which predefining local AU attentions by correlated facial landmarks often discards essential parts, or learning global attention maps often contains irrelevant areas. Furthermore, existing relational reasoning methods often employ common patterns for all AUs while ignoring the specific way of each AU. To tackle these limitations, we propose a novel adaptive attention and relation (AAR) framework for facial AU detection. Specifically, we propose an adaptive attention regression network to regress the global attention map of each AU under the constraint of attention predefinition and the guidance of AU detection, which is beneficial for capturing both specified dependencies by landmarks in strongly correlated regions and facial globally distributed dependencies in weakly correlated regions. Moreover, considering the diversity and dynamics of AUs, we propose an adaptive spatio-temporal graph convolutional network to simultaneously reason the independent pattern of each AU, the inter-dependencies among AUs, as well as the temporal dependencies. Extensive experiments show that our approach (i) achieves competitive performance on challenging benchmarks including BP4D, DISFA, and GFT in constrained scenarios and Aff-Wild2 in unconstrained scenarios, and (ii) can precisely learn the regional correlation distribution of each AU.

Abstract:
Text-based Visual Question Answering (TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial reasoning between texts and objects is crucial in TextVQA. However, existing approaches are constrained within 2D spatial information learned from the input images and rely on transformer-based architectures to reason implicitly during the fusion process. Under this setting, these 2D spatial reasoning approaches cannot distinguish the fine-grained spatial relations between visual objects and scene texts on the same image plane, thereby impairing the interpretability and performance of TextVQA models. In this paper, we introduce 3D geometric information into the spatial reasoning process to capture the contextual knowledge of key objects step-by-step. Specifically, (i) we propose a relation prediction module for accurately locating the region of interest of critical objects; (ii) we design a depth-aware attention calibration module for calibrating the OCR tokens’ attention according to critical objects. Extensive experiments show that our method achieves state-of-the-art performance on TextVQA and ST-VQA datasets. More encouragingly, our model surpasses others by clear margins of 5.7% and 12.1% on questions that involve spatial reasoning in TextVQA and ST-VQA valid split. Besides, we also verify the generalizability of our model on the text-based image captioning task.

Abstract:
Person search by language aims to retrieve the interested pedestrian images based on natural language sentences. Although great efforts have been made to address the cross-modal heterogeneity, most of the current solutions suffer from only capturing salient attributes while ignoring inconspicuous ones, being weak in distinguishing very similar pedestrians. In this work, we propose the Adaptive Salient Attribute Mask Network (ASAMN) to adaptively mask the salient attributes for cross-modal alignments, and therefore induce the model to simultaneously focus on inconspicuous attributes. Specifically, we consider the uni-modal and cross-modal relations for masking salient attributes in the Uni-modal Salient Attribute Mask (USAM) and Cross-modal Salient Attribute Mask (CSAM) modules, respectively. Then the Attribute Modeling Balance (AMB) module is presented to randomly select a proportion of masked features for cross-modal alignments, ensuring the balance of modeling capacity of both salient attributes and inconspicuous ones. Extensive experiments and analyses have been carried out to validate the effectiveness and generalization capacity of our proposed ASAMN method, and we have obtained the state-of-the-art retrieval performance on the widely-used CUHK-PEDES and ICFG-PEDES benchmarks.

Abstract:
We focus on addressing the problem of shadow removal for an image, and attempt to make a weakly supervised learning model that does not depend on the pixelwise-paired training samples, but only uses the samples with image-level labels that indicate whether an image contains shadow or not. To this end, we propose a deep reciprocal learning model that interactively optimizes the shadow remover and the shadow detector to improve the overall capability of the model. On the one hand, shadow removal is modeled as an optimization problem with a latent variable of the detected shadow mask. On the other hand, a shadow detector can be trained using the prior from the shadow remover. A self-paced learning strategy is employed to avoid fitting to intermediate noisy annotation during the interactive optimization. Furthermore, a color-maintenance loss and a shadow-attention discriminator are both designed to facilitate model optimization. Extensive experiments on the pairwise ISTD dataset, SRD dataset, and unpaired USR dataset demonstrate the superiority of the proposed deep reciprocal model.

Abstract:
Neural video codecs have demonstrated great potential in video transmission and storage applications. Existing neural hybrid video coding approaches rely on optical flow or Gaussian-scale flow for prediction, which cannot support fine-grained adaptation to diverse motion content. Towards more content-adaptive prediction, we propose a novel cross-scale prediction module that achieves more effective motion compensation. Specifically, on the one hand, we produce a reference feature pyramid as prediction sources and then transmit cross-scale flows that leverage the feature scale to control the precision of prediction. On the other hand, for the first time, a weighted prediction mechanism is introduced even if only a single reference frame is available, which can help synthesize a fine prediction result by transmitting cross-scale weight maps. In addition to the cross-scale prediction module, we further propose a multi-stage quantization strategy, which improves the rate-distortion performance with no extra computational penalty during inference. We show the encouraging performance of our efficient neural video codec (ENVC) on several benchmark datasets. In particular, the proposed ENVC can compete with the latest coding standard H.266/VVC in terms of sRGB PSNR on UVG dataset for the low-latency mode. We also analyze in detail the effectiveness of the cross-scale prediction module in handling various video content, and provide a comprehensive ablation study to analyze those important components. Test code is available at https://github.com/USTC-IMCL/ENVC.

Abstract:
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language, and serves as the basis for various visual and language tasks. Most previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words. However, the close relations between coarse- and fine-grained representations for each modality are important for image-text retrieval but almost neglected. As a result, such previous works inevitably suffer from low retrieval accuracy or heavy computational cost. In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework. This framework is consistent with human cognition, as humans simultaneously pay attention to the entire sample and regional elements to understand the semantic content. To this end, a Token-Guided Dual Transformer (TGDT) architecture which consists of two homogeneous branches for image and text modalities, respectively, is proposed for image-text retrieval. The TGDT incorporates both coarse- and fine-grained retrievals into a unified framework and beneficially leverages the advantages of both retrieval approaches. A novel training objective called Consistent Multimodal Contrastive (CMC) loss is proposed accordingly to ensure the intra- and inter-modal semantic consistencies between images and texts in the common embedding space. Equipped with a two-stage inference method based on the mixed global and local cross-modal similarity, the proposed method achieves state-of-the-art retrieval performances with extremely low inference time when compared with representative recent approaches. Code is publicly available: github.com/LCFractal/TGDT.

Abstract:
Single-view 3D object reconstruction is a fundamental and challenging computer vision task that aims at recovering 3D shapes from single-view RGB images. Most existing deep learning based reconstruction methods are trained and evaluated on the same categories, and they cannot work well when handling objects from novel categories that are not seen during training. Focusing on this issue, this paper tackles Single-view 3D Mesh Reconstruction, to study the model generalization on unseen categories and encourage models to reconstruct objects literally. Specifically, we propose an end-to-end two-stage network, GenMesh, to break the category boundaries in reconstruction. Firstly, we factorize the complicated image-to-mesh mapping into two simpler mappings, i.e., image-to-point mapping and point-to-mesh mapping, while the latter is mainly a geometric problem and less dependent on object categories. Secondly, we devise a local feature sampling strategy in 2D and 3D feature spaces to capture the local geometry shared across objects to enhance model generalization. Thirdly, apart from the traditional point-to-point supervision, we introduce a multi-view silhouette loss to supervise the surface generation process, which provides additional regularization and further relieves the overfitting problem. The experimental results show that our method significantly outperforms the existing works on the ShapeNet and Pix3D under different scenarios and various metrics, especially for novel objects.

Abstract:
We are concerned with retrieving a query person from multiple videos captured by a non-overlapping camera network. Existing methods often rely on purely visual matching or consider temporal constraints but ignore the spatial information of the camera network. To address this issue, we propose a pedestrian retrieval framework based on cross-camera trajectory generation that integrates both temporal and spatial information. To obtain pedestrian trajectories, we propose a novel cross-camera spatio-temporal model that integrates pedestrians’ walking habits and the path layout between cameras to form a joint probability distribution. Such a cross-camera spatio-temporal model can be specified using sparsely sampled pedestrian data. Based on the spatio-temporal model, cross-camera trajectories can be extracted by the conditional random field model and further optimised by restricted non-negative matrix factorization. Finally, a trajectory re-ranking technique is proposed to improve the pedestrian retrieval results. To verify the effectiveness of our method, we construct the first cross-camera pedestrian trajectory dataset, the Person Trajectory Dataset, in real surveillance scenarios. Extensive experiments verify the effectiveness and robustness of the proposed method.

Abstract:
Graph embedding aims at learning vertex representations in a low-dimensional space by distilling information from a complex-structured graph. Recent efforts in graph embedding have been devoted to generalizing the representations from the trained graph in a source domain to the new graph in a different target domain based on information transfer. However, when the graphs are contaminated by unpredictable and complex noise in practice, this transfer problem is quite challenging because of the need to extract helpful knowledge from the source graph and to reliably transfer knowledge to the target graph. This paper puts forward a two-step correntropy-induced Wasserstein GCN (graph convolutional network, or CW-GCN for short) architecture to facilitate the robustness in cross-graph embedding. In the first step, CW-GCN originally investigates correntropy-induced loss in GCN, which places bounded and smooth losses on the noisy nodes with incorrect edges or attributes. Consequently, helpful information are extracted only from clean nodes in the source graph. In the second step, a novel Wasserstein distance is introduced to measure the difference in marginal distributions between graphs, avoiding the negative influence of noise. Afterwards, CW-GCN maps the target graph to the same embedding space as the source graph by minimizing the Wasserstein distance, and thus the knowledge preserved in the first step is expected to be reliably transferred to assist the target graph analysis tasks. Extensive experiments demonstrate the significant superiority of CW-GCN over state-of-the-art methods in different noisy environments.

Abstract:
The openness of application scenarios and the difficulties of data collection make it impossible to prepare all kinds of expressions for training. Hence, detecting expression absent during the training (called alien expression) is important to enhance the robustness of the recognition system. So in this paper, we propose a facial expression recognition (FER) model, named OneExpressNet, to quantify the probability that a test expression sample belongs to the distribution of training data. The proposed model is based on variational auto-encoder and enjoys several merits. First, different from conventional one class classification protocol, OneExpressNet transfers the useful knowledge from the related domain as a constraint condition of the target distribution. By doing so, OneExpressNet will pay more attention to the descriptive region for FER. Second, features from both source and target tasks will aggregate after constructing a skip connection between the encoder and decoder. Finally, to further separate alien expression from training expression, empirical compact variation loss is jointly optimized, so that training expression will concentrate on the compact manifold of feature space. The experimental results show that our method can achieve state-of-the-art results in one class facial expression recognition on small-scale lab-controlled datasets including CFEE and KDEF, and large-scale in-the-wild datasets including RAF-DB and ExpW.

Abstract:
Transformer, the model of choice for natural language processing, has drawn scant attention from the medical imaging community. Given the ability to exploit long-term dependencies, transformers are promising to help atypical convolutional neural networks to learn more contextualized visual representations. However, most of recently proposed transformer-based segmentation approaches simply treated transformers as assisted modules to help encode global context into convolutional representations. To address this issue, we introduce nnFormer (i.e., not-another transFormer), a 3D transformer for volumetric medical image segmentation. nnFormer not only exploits the combination of interleaved convolution and self-attention operations, but also introduces local and global volume-based self-attention mechanism to learn volume representations. Moreover, nnFormer proposes to use skip attention to replace the traditional concatenation/summation operations in skip connections in U-Net like architecture. Experiments show that nnFormer significantly outperforms previous transformer-based counterparts by large margins on three public datasets. Compared to nnUNet, the most widely recognized convnet-based 3D medical segmentation model, nnFormer produces significantly lower HD95 and is much more computationally efficient. Furthermore, we show that nnFormer and nnUNet are highly complementary to each other in model ensembling. Codes and models of nnFormer are available at https://git.io/JSf3i.

Abstract:
Multi-view subspace clustering aims to integrate the complementary information contained in different views to facilitate data representation. Currently, low-rank representation (LRR) serves as a benchmark method. However, we observe that these LRR-based methods would suffer from two issues: limited clustering performance and high computational cost since (1) they usually adopt the nuclear norm with biased estimation to explore the low-rank structures; (2) the singular value decomposition of large-scale matrices is inevitably involved. Moreover, LRR may not achieve low-rank properties in both intra-views and inter-views simultaneously. To address the above issues, this paper proposes the Bi-nuclear tensor Schatten- p norm minimization for multi-view subspace clustering (BTMSC). Specifically, BTMSC constructs a third-order tensor from the view dimension to explore the high-order correlation and the subspace structures of multi-view features. The Bi-Nuclear Quasi-Norm (BiN) factorization form of the Schatten- p norm is utilized to factorize the third-order tensor as the product of two small-scale third-order tensors, which not only captures the low-rank property of the third-order tensor but also improves the computational efficiency. Finally, an efficient alternating optimization algorithm is designed to solve the BTMSC model. Extensive experiments with ten datasets of texts and images illustrate the performance superiority of the proposed BTMSC method over state-of-the-art methods.

Abstract:
Exemplar-based colorization is a challenging task, which attempts to add colors to the target grayscale image with the aid of a reference color image, so as to keep the target semantic content while with the reference color style. In order to achieve visually plausible chromatic results, it is important to sufficiently exploit the global color style and the semantic color information of the reference color image. However, existing methods are either clumsy in exploiting the semantic color information, or lack of the dedicated fusion mechanism to decorate the target grayscale image with the reference semantic color information. Besides, these methods usually use a single-stage encoder-decoder architecture, which results in the loss of spatial details. To remedy these problems, we propose an effective exemplar colorization strategy based on pyramid dual non-local attention network to exploit the long-range dependency as well as multi-scale correlation. Specifically, two symmetrical branches of pyramid non-local attention block are tailored to achieve alignments from the target feature to the reference feature and from the reference feature to the target feature respectively. The bidirectional non-local fusion strategy is further applied to get a sufficient fusion feature that achieves full semantic consistency between multi-modal information. To train the network, we propose an unsupervised learning manner, which employs the hybrid supervision including the pseudo paired supervision from the reference color images and unpaired supervision from both the target grayscale and reference color images. Extensive experimental results are provided to demonstrate that our method achieves better photo-realistic colorization performance than the state-of-the-art methods.

Abstract:
In human pose estimation methods based on graph convolutional architectures, the human skeleton is usually modeled as an undirected graph whose nodes are body joints and edges are connections between neighboring joints. However, most of these methods tend to focus on learning relationships between body joints of the skeleton using first-order neighbors, ignoring higher-order neighbors and hence limiting their ability to exploit relationships between distant joints. In this paper, we introduce a higher-order regular splitting graph network (RS-Net) for 2D-to-3D human pose estimation using matrix splitting in conjunction with weight and adjacency modulation. The core idea is to capture long-range dependencies between body joints using multi-hop neighborhoods and also to learn different modulation vectors for different body joints as well as a modulation matrix added to the adjacency matrix associated to the skeleton. This learnable modulation matrix helps adjust the graph structure by adding extra graph edges in an effort to learn additional connections between body joints. Instead of using a shared weight matrix for all neighboring body joints, the proposed RS-Net model applies weight unsharing before aggregating the feature vectors associated to the joints in order to capture the different relations between them. Experiments and ablations studies performed on two benchmark datasets demonstrate the effectiveness of our model, achieving superior performance over recent state-of-the-art methods for 3D human pose estimation.

Abstract:
In recent years, researchers have become more interested in hyperspectral image fusion (HIF) as a potential alternative to expensive high-resolution hyperspectral imaging systems, which aims to recover a high-resolution hyperspectral image (HR-HSI) from two images obtained from low-resolution hyperspectral (LR-HSI) and high-spatial-resolution multispectral (HR-MSI). It is generally assumed that degeneration in both the spatial and spectral domains is known in traditional model-based methods or that there existed paired HR-LR training data in deep learning-based methods. However, such an assumption is often invalid in practice. Furthermore, most existing works, either introducing hand-crafted priors or treating HIF as a black-box problem, cannot take full advantage of the physical model. To address those issues, we propose a deep blind HIF method by unfolding model-based maximum a posterior (MAP) estimation into a network implementation in this paper. Our method works with a Laplace distribution (LD) prior that does not need paired training data. Moreover, we have developed an observation module to directly learn degeneration in the spatial domain from LR-HSI data, addressing the challenge of spatially-varying degradation. We also propose to learn the uncertainty (mean and variance) of LD models using a novel Swin-Transformer-based denoiser and to estimate the variance of degraded images from residual errors (rather than treating them as global scalars). All parameters of the MAP estimation algorithm and the observation module can be jointly optimized through end-to-end training. Extensive experiments on both synthetic and real datasets show that the proposed method outperforms existing competing methods in terms of both objective evaluation indexes and visual qualities.

Abstract:
This paper presents a novel approach to multi-view graph learning that combines weight learning and graph learning in an alternating optimization framework. Multi-view graph learning refers to the problem of constructing a unified affinity graph using heterogeneous sources of data representation, which is a popular technique in many learning systems where no prior knowledge of data distribution is available. Our approach is based on a fusion-and-diffusion strategy, in which multiple affinity graphs are fused together via a weight learning scheme based on the unsupervised graph smoothness and utilised as a consensus prior to the diffusion. We propose a novel multi-view diffusion process that learns a manifold-aware affinity graph by propagating affinities on tensor product graphs, leveraging high-order contextual information to enhance pairwise affinities. In contrast to existing multi-view graph learning approaches, our approach is not limited by the quality of initial graphs or the assumption of a latent common subspace among multiple views. Instead, our approach is able to identify the consistency among views and fuse multiple graphs adaptively. We formulate both weight learning and diffusion-based affinity learning in a unified framework and propose an alternating optimization solver that is guaranteed to converge. The proposed approach is applied to image retrieval and clustering tasks on 16 real-world datasets. Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods for both retrieval and clustering on 13 out of 16 datasets.

Abstract:
In this paper, we propose an efficient deep learning pipeline for light field acquisition using a back-to-back dual-fisheye camera. The proposed pipeline generates a light field from a sequence of 360° raw images captured by the dual-fisheye camera. It has three main components: a convolutional network (CNN) that enforces a spatiotemporal consistency constraint on the subviews of the 360° light field, an equirectangular matching cost that aims at increasing the accuracy of disparity estimation, and a light field resampling subnet that produces the 360° light field based on the disparity information. Ablation tests are conducted to analyze the performance of the proposed pipeline using the HCI light field datasets with five objective assessment metrics (MSE, MAE, PSNR, SSIM, and GMSD). We also use real data obtained from a commercially available dual-fisheye camera to quantitatively and qualitatively test the effectiveness, robustness, and quality of the proposed pipeline. Our contributions include: 1) a novel spatiotemporal consistency loss that enforces the subviews of the 360° light field to be consistent, 2) an equirectangular matching cost that combats severe projection distortion of fisheye images, and 3) a light field resampling subnet that retains the geometric structure of spherical subviews while enhancing the angular resolution of the light field.

Abstract:
Scribble-supervised semantic segmentation is an appealing weakly supervised technique with low labeling cost. Existing approaches mainly consider diffusing the labeled region of scribble by low-level feature similarity to narrow the supervision gap between scribble labels and mask labels. In this study, we observe an annotation bias between scribble and object mask, i.e., label workers tend to scribble on the spacious region instead of corners. This label preference makes the model learn well on those frequently labeled regions but poor on rarely labeled pixels. Therefore, we propose BLPSeg to balance the label preference for complete segmentation. Specifically, the BLPSeg first predicts an annotation probability map to evaluate the rarity of labels on each image, then utilizes a novel BLP loss to balance the model training by up-weighting those rare annotations. Additionally, to further alleviate the impact of label preference, we design a local aggregation module (LAM) to propagate supervision from labeled to unlabeled regions in gradient backpropagation. We conduct extensive experiments to illustrate the effectiveness of our BLPSeg. Our single-stage method even outperforms other advanced multi-stage methods and achieves state-of-the-art performance.

Abstract:
Due to the prohibitive cost as well as technical challenges in annotating ground-truth optical flow for large-scale realistic video datasets, the existing deep learning models for optical flow estimation mostly rely on synthetic data for training, which in turn may lead to significant performance degradation under test-data distribution shift in real-world environments. In this work, we propose the methodology to tackle this important problem. We design a self-supervised learning task for adjusting the optical flow estimation model at test time. We exploit the fact that most videos are stored in compressed formats, from which compact information on motion, in the form of motion vectors and residuals, can be made readily available. We formulate the self-supervised task as motion vector prediction, and link this task to optical flow estimation. To the best of our knowledge, our Test-Time Adaption guided with Motion Vectors (TTA-MV), is the first work to perform such adaptation for optical flow. The experimental results demonstrate that TTA-MV can improve the generalization capability of various well-known deep learning methods for optical flow estimation, such as FlowNet, PWCNet, and RAFT.

Abstract:
In recent years, implicit neural representations (INR) have shown their great potential to solve many computer graphics and computer vision problems. With this technique, signals such as 2D images or 3D shapes can be fit by training multi-layer perceptrons (MLP) on continuous functions, providing many advantages over conventional discrete representations. Despite being considered a promising approach to 2D image encoding and compression, the application of INR to image collections remains a challenge, since the number of parameters needed rapidly grow with the number of images. In this paper, we propose a fully implicit approach to INR which drastically reduces the size of MLP models in multiple image representation tasks. We introduce the concept of implicit coordinate encoder (ICE) and show it can be used to scale INR with the image number; specifically, by learning a common feature space between images. Furthermore, we show that our method is valid not only for image collections but also for large (gigapixel) images by applying a “divide-and-conquer” strategy. We propose an auto-encoder deep neural network architecture, with a single ICE (encoder) and multiple MLP (decoders), which are jointly trained following a multi-task learning strategy. We demonstrate the benefits coming from ICE when it is implemented as a one-dimensional convolutional encoder, including a better performance of the downstream MLP models with an order of magnitude fewer parameters. Our method is the first one to make use of convolutional blocks in INR networks, unlike the conventional approach of using MLP architectures only. We show the benefits of ICE in two experimental scenarios: a collection of twenty-four small ( 768× 512 ) images (Kodak dataset), and a single large ( 3072× 3072 ) image (dwarf planet Pluto), achieving better quality than previous fully-implicit methods, using up to 50% fewer parameters.

Abstract:
Video frame interpolation (VFI) aims to generate predictive frames by motion-warping from bidirectional references. Most examples of VFI utilize spatiotemporal semantic information to realize motion estimation and interpolation. However, due to variable acceleration, irregular movement trajectories, and camera movement in real-world cases, they can not be sufficient to deal with non-linear middle frame estimation. In this paper, we present a reformulation of the VFI as a joint non-linear motion regression (JNMR) strategy to model the complicated inter-frame motions. Specifically, the motion trajectory between the target frame and multiple reference frames is regressed by a temporal concatenation of multi-stage quadratic models. Then, a comprehensive joint distribution is constructed to connect all temporal motions. Moreover, to reserve more contextual details for joint regression, the feature learning network is devised to explore clarified feature expressions with dense skip-connection. Later, a coarse-to-fine synthesis enhancement module is utilized to learn visual dynamics at different resolutions with multi-scale textures. The experimental VFI results show the effectiveness and significant improvement of joint motion regression over the state-of-the-art methods. The code is available at https://github.com/ruhig6/JNMR.

Abstract:
Concepts, a collective term for meaningful words that correspond to objects, actions, and attributes, can act as an intermediary for video captioning. While many efforts have been made to augment video captioning with concepts, most methods suffer from limited precision of concept detection and insufficient utilization of concepts, which could provide caption generation with inaccurate and inadequate prior information. Considering these issues, we propose a Concept-awARE video captioning framework (CARE) to facilitate plausible caption generation. Based on the encoder-decoder structure, CARE detects concepts precisely via multimodal-driven concept detection (MCD) and offers sufficient prior information to caption generation by global-local semantic guidance (G-LSG). Specifically, we implement MCD by leveraging video-to-text retrieval and the multimedia nature of videos. To achieve G-LSG, given the concept probabilities predicted by MCD, we weight and aggregate concepts to mine the video’s latent topic to affect decoding globally and devise a simple yet efficient hybrid attention module to exploit concepts and video content to impact decoding locally. Finally, to develop CARE, we emphasize on the knowledge transfer of a contrastive vision-language pre-trained model (i.e., CLIP) in terms of visual understanding and video-to-text retrieval. With the multi-role CLIP, CARE can outperform CLIP-based strong video captioning baselines with affordable extra parameter and inference latency costs. Extensive experiments on MSVD, MSR-VTT, and VATEX datasets demonstrate the versatility of our approach for different encoder-decoder networks and the superiority of CARE against state-of-the-art methods. Our code is available at https://github.com/yangbang18/CARE.

Affiliations: State Key Laboratory of Information Security, Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China; School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen Campus, Shenzhen, China; School of Engineering, Westlake University, Hangzhou, China; Department of Automation, State Key Laboratory of Intelligent Technologies and Systems, Beijing Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China

Abstract:
Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model’s randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

Abstract:
Context modeling or multi-level feature fusion methods have been proved to be effective in improving semantic segmentation performance. However, they are not specialized to deal with the problems of pixel-context mismatch and spatial feature misalignment, and the high computational complexity hinders their widespread application in real-time scenarios. In this work, we propose a lightweight Context and Spatial Feature Calibration Network (CSFCN) to address the above issues with pooling-based and sampling-based attention mechanisms. CSFCN contains two core modules: Context Feature Calibration (CFC) module and Spatial Feature Calibration (SFC) module. CFC adopts a cascaded pyramid pooling module to efficiently capture nested contexts, and then aggregates private contexts for each pixel based on pixel-context similarity to realize context feature calibration. SFC splits features into multiple groups of sub-features along the channel dimension and propagates sub-features therein by the learnable sampling to achieve spatial feature calibration. Extensive experiments on the Cityscapes and CamVid datasets illustrate that our method achieves a state-of-the-art trade-off between speed and accuracy. Concretely, our method achieves 78.7% mIoU with 70.0 FPS and 77.8% mIoU with 179.2 FPS on the Cityscapes and CamVid test sets, respectively. The code is available at https://nave.vr3i.com/ and https://github.com/kaigelee/CSFCN.

Abstract:
To achieve efficient inference with a hardware-friendly design, Adder Neural Networks (ANNs) are proposed to replace expensive multiplication operations in Convolutional Neural Networks (CNNs) with cheap additions through utilizing \ell _1 -norm for similarity measurement instead of cosine distance. However, we observe that there exists an increasing gap between CNNs and ANNs with reducing parameters, which cannot be eliminated by existing algorithms. In this paper, we present a simple yet effective Norm-Guided Distillation (NGD) method for \ell _1 -norm ANNs to learn superior performance from \ell _2 -norm ANNs. Although CNNs achieve similar accuracy with \ell _2 -norm ANNs, the clustering performance based on \ell _2 -distance can be easily learned by \ell _1 -norm ANNs compared with cross correlation in CNNs. The features in \ell _2 -norm ANNs are encouraged to achieve intra-class centralization and inter-class decentralization to amplify this advantage. Furthermore, the roughly estimated gradients in vanilla ANNs are modified to a progressive approximation from \ell _2 -norm to \ell _1 -norm so that a more accurate optimization can be achieved. Extensive evaluations on several benchmarks demonstrate the effectiveness of NGD on lightweight networks. For example, our method improves ANN by 10.43% with 0.25× GhostNet on CIFAR-100 and 3.1% with 1.0× GhostNet on ImageNet.

Abstract:
Recently, feature relation learning has attracted extensive attention in cross-spectral image patch matching. However, most feature relation learning methods can only extract shallow feature relations and are accompanied by the loss of useful discriminative features or the introduction of disturbing features. Although the latest multi-branch feature difference learning network can relatively sufficiently extract useful discriminative features, the multi-branch network structure it adopts has a large number of parameters. Therefore, we propose a novel two-branch feature interaction learning network (FIL-Net). Specifically, a novel feature interaction learning idea for cross-spectral image patch matching is proposed, and a new feature interaction learning module is constructed, which can effectively mine common and private features between cross-spectral image patches, and extract richer and deeper feature relations with invariance and discriminability. At the same time, we re-explore the feature extraction network for the cross-spectral image patch matching task, and a new two-branch residual feature extraction network with stronger feature extraction capabilities is constructed. In addition, we propose a new multi-loss strong-constrained optimization strategy, which can facilitate reasonable network optimization and efficient extraction of invariant and discriminative features. Furthermore, a public VIS-LWIR patch dataset and a public SEN1-2 patch dataset are constructed. At the same time, the corresponding experimental benchmarks are established, which are convenient for future research while solving few existing cross-spectral image patch matching datasets. Extensive experiments show that the proposed FIL-Net achieves state-of-the-art performance in three different cross-spectral image patch matching scenarios.

Abstract:
Existing salient object detection methods often adopt deeper and wider networks for better performance, resulting in heavy computational burden and slow inference speed. This inspires us to rethink saliency detection to achieve a favorable balance between efficiency and accuracy. To this end, we design a lightweight framework while maintaining satisfying competitive accuracy. Specifically, we propose a novel trilateral decoder framework by decoupling the U-shape structure into three complementary branches, which are devised to confront the dilution of semantic context, loss of spatial structure and absence of boundary detail, respectively. Along with the fusion of three branches, the coarse segmentation results are gradually refined in structure details and boundary quality. Without adding additional learnable parameters, we further propose Scale-Adaptive Pooling Module to obtain multi-scale receptive field. In particular, on the premise of inheriting this framework, we rethink the relationship among accuracy, parameters and speed via network depth-width tradeoff. With these insightful considerations, we comprehensively design shallower and narrower models to explore the maximum potential of lightweight SOD. Our models are proposed for different application environments: 1) a tiny version CTD-S (1.7M, 125FPS) for resource constrained devices, 2) a fast version CTD-M (12.6M, 158FPS) for speed-demanding scenarios, 3) a standard version CTD-L (26.5M, 84FPS) for high-performance platforms. Extensive experiments validate the superiority of our method, which achieves better efficiency-accuracy balance across five benchmarks.

Abstract:
The synthesis of high-resolution (HR) hyperspectral image (HSI) by fusing a low-resolution HSI with a corresponding HR multispectral image has emerged as a prevalent HSI super-resolution (HSR) scheme. Recent researches have revealed that tensor analysis is an emerging tool for HSR. However, most off-the-shelf tensor-based HSR algorithms tend to encounter challenges in rank determination and modeling capacity. To address these issues, we construct nonlocal patch tensors (NPTs) and characterize low-rank structures with coupled Bayesian tensor factorization. It is worth emphasizing that the intrinsic global spectral correlation and nonlocal spatial similarity can be simultaneously explored under the proposed model. Moreover, benefiting from the technique of automatic relevance determination, we propose a hierarchical probabilistic framework based on Canonical Polyadic (CP) factorization, which incorporates a sparsity-inducing prior over the underlying factor matrices. We further develop an effective expectation-maximization-type optimization scheme for framework estimation. In contrast to existing works, the proposed model can infer the latent CP rank of NPT adaptively without tuning parameters. Extensive experiments on synthesized and real datasets illustrate the intrinsic capability of our model in rank determination as well as its superiority in fusion performance.

Abstract:
The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temporal consistency provided by the optical flow could be effectively supplemented by modeling in a structural form. This paper proposes a new hierarchical graph neural network (GNN) architecture, dubbed hierarchical graph pattern understanding (HGPU), for zero-shot video object segmentation (ZS-VOS). Inspired by the strong ability of GNNs in capturing structural relations, HGPU innovatively leverages motion cues (i.e., optical flow) to enhance the high-order representations from the neighbors of target frames. Specifically, a hierarchical graph pattern encoder with message aggregation is introduced to acquire different levels of motion and appearance features in a sequential manner. Furthermore, a decoder is designed for hierarchically parsing and understanding the transformed multi-modal contexts to achieve more accurate and robust results. HGPU achieves state-of-the-art performance on four publicly available benchmarks (DAVIS-16, YouTube-Objects, Long-Videos and DAVIS-17). Code and pre-trained model can be found at https://github.com/NUST-Machine-Intelligence-Laboratory/HGPU.

Abstract:
Semi-supervised video object segmentation (Semi-VOS), which requires only annotating the first frame of a video to segment future frames, has received increased attention recently. Among existing Semi-VOS pipelines, the memory-matching-based one is becoming the main research stream, as it can fully utilize the temporal sequence information to obtain high-quality segmentation results. Even though this type of method has achieved promising performance, the overall framework still suffers from heavy computation overhead, mainly caused by the per-frame dense convolution operations between high-resolution feature maps and each kernel filter. Therefore, we propose a sparse baseline of VOS named SpVOS in this work, which develops a novel triple sparse convolution to reduce the computation costs of the overall VOS framework. The designed triple gate, taking full consideration of both spatial and temporal redundancy between adjacent video frames, adaptively makes a triple decision to decide how to apply the sparse convolution on each pixel to control the computation overhead of each layer, while maintaining sufficient discrimination capability to distinguish similar objects and avoid error accumulation. A mixed sparse training strategy, coupled with a designed objective considering the sparsity constraint, is also developed to balance the VOS segmentation performance and computation costs. Experiments are conducted on two mainstream VOS datasets, including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves superior performance over other state-of-the-art sparse methods, and even maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to 42% FLOPs, showing its application potential for resource-constrained scenarios.

Abstract:
Video frame interpolation (VFI) is a fundamental research topic in video processing, which is currently attracting increased attention across the research community. While the development of more advanced VFI algorithms has been extensively researched, there remains little understanding of how humans perceive the quality of interpolated content and how well existing objective quality assessment methods perform when measuring the perceived quality. In order to narrow this research gap, we have developed a new video quality database named BVI-VFI, which contains 540 distorted sequences generated by applying five commonly used VFI algorithms to 36 diverse source videos with various spatial resolutions and frame rates. We collected more than 10,800 quality ratings for these videos through a large scale subjective study involving 189 human subjects. Based on the collected subjective scores, we further analysed the influence of VFI algorithms and frame rates on the perceptual quality of interpolated videos. Moreover, we benchmarked the performance of 33 classic and state-of-the-art objective image/video quality metrics on the new database, and demonstrated the urgent requirement for more accurate bespoke quality assessment methods for VFI. To facilitate further research in this area, we have made BVI-VFI publicly available at https://github.com/danier97/BVI-VFI-database.

Abstract:
It is challenging to generate temporal action proposals from untrimmed videos. In general, boundary-based temporal action proposal generators are based on detecting temporal action boundaries, where a classifier is usually applied to evaluate the probability of each temporal action location. However, most existing approaches treat boundaries and contents separately, which neglect that the context of actions and the temporal locations complement each other, resulting in incomplete modeling of boundaries and contents. In addition, temporal boundaries are often located by exploiting either local clues or global information, without mining local temporal information and temporal-to-temporal relations sufficiently at different levels. Facing these challenges, a novel approach named multi-level content-aware boundary detection (MCBD) is proposed to generate temporal action proposals from videos, which jointly models the boundaries and contents of actions and captures multi-level (i.e., frame level and proposal level) temporal and context information. Specifically, the proposed MCBD preliminarily mines rich frame-level features to generate one-dimensional probability sequences, and further exploits temporal-to-temporal proposal-level relations to produce two-dimensional probability maps. The final temporal action proposals are obtained by a fusion of the multi-level boundary and content probabilities, achieving precise boundaries and reliable confidence of proposals. The extensive experiments on the three benchmark datasets of THUMOS14, ActivityNet v1.3 and HACS demonstrate the effectiveness of the proposed MCBD compared to state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn.

Abstract:
Cross-component chroma prediction plays an important role in improving coding efficiency for H.266/VVC. We use the differences between reference samples and the predicted sample to design an attention model for chroma prediction, namely luma difference-based chroma prediction (LDCP). Specifically, the luma differences (LDs) between reference samples and the predicted sample are employed as the input of the attention model, which is designed as a softmax function to map LDs to chroma weights nonlinearly. Finally, a weighted chroma prediction is conducted based on the weights and chroma reference samples. To provide adaptive weights, the model parameter of the softmax function can be determined based on the template (T-LDCP) or offline learning (L-LDCP), respectively. Experimental results show that the T-LDCP achieves BD-rate reductions of 0.34%, 2.02%, and 2.34% for the Y, Cb, and Cr components, and the L-LDCP brings 0.32%, 2.06%, and 2.21% BD-rate savings for Y, Cb, and Cr components, respectively. The L-LDCP introduces slight encoding and decoding time increments, i.e., 2% and 1%, when integrated into the latest VVC test model version 18.0. Besides, the LDCP can be implemented by a pixel-level parallelization which is hardware-friendly.

Abstract:
The transferability of adversarial examples across different convolutional neural networks (CNNs) makes it feasible to perform black-box attacks, resulting in security threats for CNNs. However, fewer endeavors have been made to investigate transferable attacks for vision transformers (ViTs), which achieve superior performance on various computer vision tasks. Unlike CNNs, ViTs establish relationships between patches extracted from inputs by the self-attention module. Thus, adversarial examples crafted on CNNs might hardly attack ViTs. To assess the security of ViTs comprehensively, we investigate the transferability across different ViTs in both untargetd and targeted scenarios. More specifically, we propose a Pay No Attention (PNA) attack, which ignores attention gradients during backpropagation to improve the linearity of backpropagation. Additionally, we introduce a PatchOut/CubeOut attack for image/video ViTs. They optimize perturbations within a randomly selected subset of patches/cubes during each iteration, preventing over-fitting to the white-box surrogate ViT model. Furthermore, we maximize the L_2 norm of perturbations, ensuring that the generated adversarial examples deviate significantly from the benign ones. These strategies are designed to be harmoniously compatible. Combining them can enhance transferability by jointly considering patch-based inputs and the self-attention of ViTs. Moreover, the proposed combined attack seamlessly integrates with existing transferable attacks, providing an additional boost to transferability. We conduct experiments on ImageNet and Kinetics-400 for image and video ViTs, respectively. Experimental results demonstrate the effectiveness of the proposed method.

Abstract:
How to avoid biased predictions is an important and active research question in scene graph generation (SGG). Current state-of-the-art methods employ debiasing techniques such as resampling and causality analysis. However, the role of intrinsic cues in the features causing biased training has remained under-explored. In this paper, for the first time, we make the surprising observation that object identity information, in the form of object label embeddings (e.g. GLOVE), is principally responsible for biased predictions. We empirically observe that, even without any visual features, a number of recent SGG models can produce comparable or even better results solely from object label embeddings. Motivated by this insight, we propose to leverage a conditional variational auto-encoder to decouple the entangled visual features into two meaningful components: the object’s intrinsic identity features and the extrinsic, relation-dependent state feature. We further develop two compositional learning strategies on the relation and object levels to mitigate the data scarcity issue of rare relations. On the two benchmark datasets Visual Genome and GQA, we conduct extensive experiments on the three scenarios, i.e., conventional, few-shot and zero-shot SGG. Results consistently demonstrate that our proposed Decomposition and Composition (DeC) method effectively alleviates the biases in the relation prediction. Moreover, DeC is model-free, and it significantly improves the performance of recent SGG models, establishing new state-of-the-art performance.

Abstract:
JPEG, which was developed 30 years ago, is the most widely used image coding format, especially favored by the resource-deficient devices, due to its simplicity and efficiency. With the evolution of the Internet and the popularity of mobile devices, a huge amount of user-generated JPEG images are uploaded to social media sites like Facebook and Flickr or stored in personal computers or notebooks, which leads to an increase in storage cost. However, the performance of JPEG is far from the-state-of-the art coding methods. Therefore, the lossless recompression of JPEG images is urgent to be studied, which will further reduce the storage cost while maintaining the image fidelity. In this paper, a hybrid coding framework for the lossless recompression of JPEG images (LLJPEG) using transform domain intra prediction is proposed, including block partition and intraprediction, transform and quantization, and entropy coding. Specifically, in LLJPEG, intra prediction is first used to obtain a predicted block. Then the predicted block is transformed by DCT and then quantized to obtain the predicted coefficients. After that, the predicted coefficients are subtracted from the original coefficients to get the DCT coefficient residuals. Finally, the DCT residuals are entropy coded. In LLJPEG, some new coding tools are proposed for intra prediction and the entropy coding is redesigned. The experiments show that LLJPEG can reduce the storage space by 29.43% and 26.40% on the Kodak and DIV2K datasets respectively without any loss for JPEG images, while maintaining low decoding complexity.

Abstract:
Cross-modality face image synthesis such as sketch-to-photo, NIR-to-RGB, and RGB-to-depth has wide applications in face recognition, face animation, and digital entertainment. Conventional cross-modality synthesis methods usually require paired training data, i.e., each subject has images of both modalities. However, paired data can be difficult to acquire, while unpaired data commonly exist. In this paper, we propose a novel semi-supervised cross-modality synthesis method (namely CMOS-GAN), which can leverage both paired and unpaired face images to learn a robust cross-modality synthesis model. Specifically, CMOS-GAN uses a generator of encoder-decoder architecture for new modality synthesis. We leverage pixel-wise loss, adversarial loss, classification loss, and face feature loss to exploit the information from both paired multi-modality face images and unpaired face images for model learning. In addition, since we expect the synthetic new modality can also be helpful for improving face recognition accuracy, we further use a modified triplet loss to retain the discriminative features of the subject in the synthetic modality. Experiments on three cross-modality face synthesis tasks (NIR-to-VIS, RGB-to-depth, and sketch-to-photo) show the effectiveness of the proposed approach compared with the state-of-the-art. In addition, we also collect a large-scale RGB-D dataset (VIPL-MumoFace-3K) for the RGB-to-depth synthesis task. We plan to open-source our code and VIPL-MumoFace-3K dataset to the community (https://github.com/skgyu/CMOS-GAN).

Abstract:
There exist a variety of visual relationships among entities in an image. Given a relationship query \langle subject, predicate, object \rangle , the task of visual relationship referring (VRR) aims to disambiguate instances of the same entity category and simultaneously localize the subject and object entities in an image. Previous works of VRR can be generally categorized into one-stage and multi-stage methods. The former ones directly localize a pair of entities from the image but they suffer from low prediction accuracy, while the latter ones perform better but they are indirect to localize only a couple of entities by pre-generating a rich amount of candidate proposals. In this paper, we formulate the task of VRR as an end-to-end bounding box regression problem and propose a novel one-stage approach, called VRR-TAMP, by effectively integrating Transformers and an adaptive message passing mechanism. First, visual relationship queries and images are respectively encoded to generate the basic modality-specific embeddings, which are then fed into a cross-modal Transformer encoder to produce the joint representation. Second, to obtain the specific representation of each entity, we introduce an adaptive message passing mechanism and design an entity-specific information distiller SR-GMP, which refers to a gated message passing (GMP) module that works on the joint representation learned from a single learnable token. The GMP module adaptively distills the final representation of an entity by incorporating the contextual cues regarding the predicate and the other entity. Experiments on VRD and Visual Genome datasets demonstrate that our approach significantly outperforms its one-stage competitors and achieves competitive results with the state-of-the-art multi-stage methods.

Abstract:
Neuromorphic vision sensors, whose pixels output events/spikes asynchronously with a high temporal resolution according to the scene radiance change, are naturally appropriate for capturing high-speed motion in the scenes. However, how to utilize the events/spikes to smoothly track high-speed moving objects is still a challenging problem. Existing approaches either employ time-consuming iterative optimization, or require large amounts of labeled data to train the object detector. To this end, we propose a bio-inspired unsupervised learning framework, which takes advantage of the spatiotemporal information of events/spikes generated by neuromorphic vision sensors to capture the intrinsic motion patterns. Without off-line training, our models can filter the redundant signals with dynamic adaption module based on short-term plasticity, and extract the motion patterns with motion estimation module based on the spike-timing-dependent plasticity. Combined with the spatiotemporal and motion information of the filtered spike stream, the traditional DBSCAN clustering algorithm and Kalman filter can effectively track multiple targets in extreme scenes. We evaluate the proposed unsupervised framework for object detection and tracking tasks on synthetic data, publicly available event-based datasets, and spiking camera datasets. The experiment results show that the proposed model can robustly detect and smoothly track the moving targets on various challenging scenarios and outperforms state-of-the-art approaches.

Abstract:
Structured light 3D imaging is often used for obtaining accurate 3D information via phase retrieval. Single-pattern structured light 3D imaging is much faster than multi-pattern versions. Current phase retrieval methods for single-pattern structured light 3D imaging are however not accurate enough. Besides, the projector resolution in a structured light 3D imaging system is expensive to improve due to hardware costs. To address the issues of low accuracy and low resolution of single-pattern structured light 3D imaging, this work proposes a super-resolution phase retrieval network (SRPRNet). Specifically, a phase-shifting module is proposed to extract multi-scale features with different phase shifts, and a refinement and super-resolution module is proposed to obtain refined and super-resolution phase components. After phase demodulation and unwrapping, high-resolution absolute phase is obtained. A sine shifting loss and a cosine shifting loss are also introduced to form the regularization term of the loss function. As far as can be ascertained, the proposed SRPRNet is the first network for super-resolution phase retrieval by using a single pattern, and it can also be used for standard-resolution phase retrieval. Experimental results on three datasets show that SRPRNet achieves state-of-the-art performance on 1× , 2× , and 4× super-resolution phase retrieval tasks.

Abstract:
Inspired by our observation that numerous objects of remote sensing imageries are extremely consistent in geometric characteristics (e.g., object sizes/angles/layouts), in this work, we propose a novel Progressive Context-dependent Inference (PCI) method to make full use of large-scope contextual cues for better localizing objects in remote sensing imagery. Especially, to represent candidate objects and their geometric distributions, we build all of them into candidate object graphs, and subsequently perform inference learning by diffusing contextual object information. To make the inference more credible, we progressively accumulate these historical learning experiences on both label prediction and location regression processes into the next stage of network evolution, where topology structures and attributes of candidate object graphs would be dynamically updated. The graph update and ground object detection are jointly encapsulated as a closed-looping learning process. Hereby the problem of multi-object localization is converted into a progressive construction of dynamic graphs. Extensive experiments on three public datasets demonstrate the superiority of our proposed method over other state-of-the-art methods for ground object detection in remote sensing imagery.

Abstract:
The sparsity is an attractive property that has been widely and intensively utilized in various image processing fields (e.g., robust image representation, image compression, image analysis, etc.). Its actual success owes to the exhaustive mining of the intrinsic (or homogenous) information from the whole data carrying redundant information. From the perspective of image representation, the sparsity can successfully find an underlying homogenous subspace from a collection of training data to represent a given test sample. The famous sparse representation (SR) and its variants embed the sparsity by representing the test sample using a linear combination of training samples with L_0 -norm regularization and L_1 -norm regularization. However, although these state-of-the-art methods achieve powerful and robust performances, the sparsity is not fully exploited on the image representation in the following three aspects: 1) the within-sample sparsity, 2) the between-sample sparsity, and 3) the image structural sparsity. In this paper, to make the above-mentioned multi-context sparsity properties agree and simultaneously learned in one model, we propose the concept of consensus sparsity (Con-sparsity) and correspondingly build a multi-context sparse image representation (MCSIR) framework to realize this. We theoretically prove that the consensus sparsity can be achieved by the L_\infty -induced matrix variate based on the Bayesian inference. Extensive experiments and comparisons with the state-of-the-art methods (including deep learning) are performed to demonstrate the promising performance and property of the proposed consensus sparsity.

Abstract:
Referring Expression Comprehension (REC) is an important task in the vision-and-language community, since it is an essential step for many cross-modal tasks such as VQA, image retrieval and image caption. To obtain a better trade-off between speed and accuracy, existing researches usually follow a one-stage paradigm, where this task can be considered as a language-conditioned object detection task. Meanwhile, previous one-stage REC frameworks provide many different research perspectives, such as the strategies of fusion, the stage of fusion and the design of detection head. Surprisingly, these works mostly ignore the value of integrating multi-level features and even only apply single-scale features to locate the target. In this paper, we focus on rethinking and improving feature pyramids for one-stage REC. By experimental validations, we first prove that although multi-scale fusion is an effective approach for improving performance, the mature neck structures from object detection (e.g., FPN, BFN and HRFPN) have a limited impact on this task. Further, we visualize the outputs of FPN and find the underlying reason is that these coarse-grained FPN fusion strategies suffer from semantic ambiguity problem. Based on the above insights, we propose a new Language-Guided FPN (LG-FPN) method, which can dynamically allocate and select the fine-grained information by stacking language-gate and union-gate. A large number of contrastive and ablative experiments show that our LG-FPN is an effective and reliable module that can adapt to different visual backbones, fusion strategies and detection heads. Finally, our method achieves state-of-the-art performance on four referring expression datasets.

Abstract:
Depth maps generally suffer from large erroneous areas even in public RGB-Depth datasets. Existing learning-based depth recovery methods are limited by insufficient high-quality datasets and optimization-based methods generally depend on local contexts not to effectively correct large erroneous areas. This paper develops an RGB-guided depth map recovery method based on the fully connected conditional random field (dense CRF) model to jointly utilize local and global contexts of depth maps and RGB images. A high-quality depth map is inferred by maximizing its probability conditioned upon a low-quality depth map and a reference RGB image based on the dense CRF model. The optimization function is composed of redesigned unary and pairwise components, which constraint local structure and global structure of depth map, respectively, with the guidance of RGB image. In addition, the texture-copy artifacts problem is handled by two-stage dense CRF models in a coarse-to-fine way. A coarse depth map is first recovered by embedding RGB image in a dense CRF model in unit of 3× 3 blocks. It is refined afterward by embedding RGB image in another model in unit of individual pixels and restricting the model mainly work in discontinued regions. Extensive experiments on six datasets verify that the proposed method considerably outperforms a dozen of baseline methods in correcting erroneous areas and diminishing texture-copy artifacts of depth maps.

Abstract:
Occluded person re-identification (re-id) aims to match occluded person images to holistic ones. Most existing works focus on matching collective-visible body parts by discarding the occluded parts. However, only preserving the collective-visible body parts causes great semantic loss for occluded images, decreasing the confidence of feature matching. On the other hand, we observe that the holistic images can provide the missing semantic information for occluded images of the same identity. Thus, compensating the occluded image with its holistic counterpart has the potential for alleviating the above limitation. In this paper, we propose a novel Reasoning and Tuning Graph Attention Network (RTGAT), which learns complete person representations of occluded images by jointly reasoning the visibility of body parts and compensating the occluded parts for the semantic loss. Specifically, we self-mine the semantic correlation between part features and the global feature to reason the visibility scores of body parts. Then we introduce the visibility scores as the graph attention, which guides Graph Convolutional Network (GCN) to fuzzily suppress the noise of occluded part features and propagate the missing semantic information from the holistic image to the occluded image. We finally learn complete person representations of occluded images for effective feature matching. Experimental results on occluded benchmarks demonstrate the superiority of our method.

Abstract:
Crowd localization is to predict each instance head position in crowd scenarios. Since the distance of pedestrians being to the camera are variant, there exists tremendous gaps among scales of instances within an image, which is called the intrinsic scale shift. The core reason of intrinsic scale shift being one of the most essential issues in crowd localization is that it is ubiquitous in crowd scenes and makes scale distribution chaotic. To this end, the paper concentrates on access to tackle the chaos of the scale distribution incurred by intrinsic scale shift. We propose Gaussian Mixture Scope (GMS) to regularize the chaotic scale distribution. Concretely, the GMS utilizes a Gaussian mixture distribution to adapt to scale distribution and decouples the mixture model into sub-normal distributions to regularize the chaos within the sub-distributions. Then, an alignment is introduced to regularize the chaos among sub-distributions. However, despite that GMS is effective in regularizing the data distribution, it amounts to dislodging the hard samples in training set, which incurs overfitting. We assert that it is blamed on the block of transferring the latent knowledge exploited by GMS from data to model. Therefore, a Scoped Teacher playing a role of bridge in knowledge transform is proposed. What’ s more, the consistency regularization is also introduced to implement knowledge transform. To that effect, the further constraints are deployed on Scoped Teacher to derive feature consistence between teacher and student end. With proposed GMS and Scoped Teacher implemented on four mainstream datasets of crowd localization, the extensive experiments demonstrate the superiority of our work. Moreover, comparing with existing crowd locators, our work achieves state-of-the-art via F1-measure comprehensively on four datasets.

Abstract:
With the popularity of mobile Internet, audio and video (A/V) have become the main way for people to entertain and socialize daily. However, in order to reduce the cost of media storage and transmission, A/V signals will be compressed by service providers before they are transmitted to end-users, which inevitably causes distortions in the A/V signals and degrades the end-user’s Quality of Experience (QoE). This motivates us to research the objective audio-visual quality assessment (AVQA). In the field of AVQA, most previous works only focus on single-mode audio or visual signals, which ignores that the perceptual quality of users depends on both audio and video signals. Therefore, we propose an objective AVQA architecture for multi-mode signals based on attentional neural networks. Specifically, we first utilize an attention prediction model to extract the salient regions of video frames. Then, a pre-trained convolutional neural network is used to extract short-time features of the salient regions and the corresponding audio signals. Next, the short-time features are fed into Gated Recurrent Unit (GRU) networks to model the temporal relationship between adjacent frames. Finally, the fully connected layers are utilized to fuse the temporal related features of A/V signals modeled by the GRU network into the final quality score. The proposed architecture is flexible and can be applied to both full-reference and no-reference AVQA. Experimental results on the LIVE-SJTU Database and UnB-AVC Database demonstrate that our model outperforms the state-of-the-art AVQA methods. The code of the proposed method will be publicly available to promote the development of the field of AVQA.

Abstract:
Weakly-supervised object detection (WSOD), which requires only image-level annotations for training detectors, has gained enormous attention. Despite recent rapid advance in WSOD, there remains a large performance gap compared with fully-supervised object detection. To narrow the performance gap, we study cross-supervised object detection (CSOD), where existing classes (base classes) have instance-level annotations while newly added classes (novel classes) only need image-level annotations. For improving localization accuracy, we propose a Cyclic Self-Training (CST) method to introduce instance-level supervision into a commonly used WSOD method, online instance classifier refinement (OICR). Our proposed CST consists of forward pseudo labeling and backward pseudo labeling. Specifically, OICR exploits the forward pseudo labeling to generate pseudo ground-truth bounding-boxes for all classes, thus enabling instance classifier training. Then, the backward pseudo labeling is designed to generate pseudo ground-truth bounding-boxes of higher quality for novel classes by fusing the predictions of the instance classifiers. As a result, both novel and base classes will have bounding-box annotations for training, alleviating the supervision inconsistency between base and novel classes. In the forward pseudo labeling, the generated pseudo ground-truths may be misaligned with objects and thus introduce poor-quality examples for training the ICs. To reduce the impacts of these poor-quality training examples, we propose a Proposal Weight Modulation (PWM) module learned in a class-agnostic and contrastive manner by exploiting bounding-box annotations of base classes. Experiments on PASCAL VOC and MS COCO datasets demonstrate the superiority of our proposed method.

Abstract:
Domain generalizable person re-identification (DG ReID) is a challenging problem, because the trained model is often not generalizable to unseen target domains with different distribution from the source training domains. Data augmentation has been verified to be beneficial for better exploiting the source data to improve the model generalization. However, existing approaches primarily rely on pixel-level image generation that requires designing and training an extra generation network, which is extremely complex and provides limited diversity of augmented data. In this paper, we propose a simple yet effective feature based augmentation technique, named Style-uncertainty Augmentation (SuA). The main idea of SuA is to randomize the style of training data by perturbing the instance style with Gaussian noise during training process to increase the training domain diversity. And to better generalize knowledge across these augmented domains, we propose a progressive learning to learn strategy named Self-paced Meta Learning (SpML) that extends the conventional one-stage meta learning to multi-stage training process. The rationality is to gradually improve the model generalization ability to unseen target domains by simulating the mechanism of human learning. Furthermore, conventional person Re-ID loss functions are unable to leverage the valuable domain information to improve the model generalization. So we further propose a distance-graph alignment loss that aligns the feature relationship distribution among domains to facilitate the network to explore domain-invariant representations of images. Extensive experiments on four large-scale benchmarks demonstrate that our SuA-SpML achieves state-of-the-art generalization to unseen domains for person ReID.

Abstract:
The supervised one-shot multi-object tracking (MOT) algorithms have achieved satisfactory performance benefiting from a large amount of labeled data. However, in real applications, acquiring plenty of laborious manual annotations is not practical. It is necessary to adapt the one-shot MOT model trained on a labeled domain to an unlabeled domain, yet such domain adaptation is a challenging problem. The main reason is that it has to detect and associate multiple moving objects distributed in various spatial locations, but there are obvious discrepancies in style, object identity, quantity, and scale among different domains. Motivated by this, we propose a novel inference-domain network evolution to enhance the generalization ability of the one-shot MOT model. Specifically, we design a spatial topology-based one-shot network (STONet) to perform the one-shot MOT task, where a self-supervision mechanism is employed to stimulate the feature extractor to learn the spatial contexts without any annotated information. Furthermore, a temporal identity aggregation (TIA) module is proposed to assist STONet to weaken the adverse effects of noisy labels in the network evolution. This designed TIA aggregates historical embeddings with the same identity to learn cleaner and more reliable pseudo labels. In the inference domain, the proposed STONet with TIA performs pseudo label collection and parameter update progressively to realize the network evolution from the labeled source domain to an unlabeled inference domain. Extensive experiments and ablation studies conducted on MOT15, MOT17, and MOT20, demonstrate the effectiveness of our proposed model.

Abstract:
Deploying Convolutional Neural Network (CNN)-based applications to mobile platforms can be challenging due to the conflict between the restricted computing capacity of mobile devices and the heavy computational overhead of running a CNN. Network quantization is a promising way of alleviating this problem. However, network quantization can result in accuracy degradation and this is especially the case with the compact CNN architectures that are designed for mobile applications. This paper presents a novel and efficient mixed-precision quantization pipeline, called MBFQuant. It redefines the design space for mixed-precision quantization by keeping the bitwidth of the multiplier fixed, unlike other existing methods, because we have found that the quantized model can maintain almost the same running efficiency, so long as the sum of the quantization bitwidth of the weight and the input activation of a layer is a constant. To maximize the accuracy of a quantized CNN model, we have developed a Simulated Annealing (SA)-based optimizer that can automatically explore the design space, and rapidly find the optimal bitwidth assignment. Comprehensive evaluations applying ten CNN architectures to four datasets have served to demonstrate that MBFQuant can achieve improvements in accuracy of up to 19.34% for image classification and 1.12% for object detection, with respect to a corresponding uniform bitwidth quantized model.

Abstract:
Hyperspectral image (HSI) classification is challenging due to spatial variability caused by complex imaging conditions. Prior methods suffer from limited representation ability, as they train specially designed networks from scratch on limited annotated data. We propose a tri-spectral image generation pipeline that transforms HSI into high-quality tri-spectral images, enabling the use of off-the-shelf ImageNet pretrained backbone networks for feature extraction. Motivated by the observation that there are many homogeneous areas with distinguished semantic and geometric properties in HSIs, which can be used to extract useful contexts, we propose an end-to-end segmentation network named DCN-T. It adopts transformers to effectively encode regional adaptation and global aggregation spatial contexts within and between the homogeneous areas discovered by similarity-based clustering. To fully exploit the rich spectrums of the HSI, we adopt an ensemble approach where all segmentation results of the tri-spectral images are integrated into the final prediction through a voting scheme. Extensive experiments on three public benchmarks show that our proposed method outperforms state-of-the-art methods for HSI classification. The code will be released at https://github.com/DotWang/DCN-T.

Abstract:
Attribute-based person search aims to find the target person from the gallery images based on the given query text. It often plays an important role in surveillance systems when visual information is not reliable, such as identifying a criminal from a few witnesses. Although recent works have made great progress, most of them neglect the attribute labeling problems that exist in the current datasets. Moreover, these problems also increase the risk of non-alignment between attribute texts and visual images, leading to large semantic gaps. To address these issues, in this paper, we propose Weak Semantic Embeddings (WSEs), which can modify the data distribution of the original attribute texts and thus improve the representability of attribute features. We also introduce feature graphs to learn more collaborative and calibrated information. Furthermore, the relationship modeled by our feature graphs between all semantic embeddings can reduce the semantic gap in text-to-image retrieval. Extensive evaluations on three challenging benchmarks - PETA, Market-1501 Attribute, and PA100K, demonstrate the effectiveness of the proposed WSEs, and our method outperforms existing state-of-the-art methods.

Abstract:
We present an efficient algorithm to approximate the Automatic Color Equalization (ACE) of an input color image, with an upper-bound on the introduced approximation error. The computation is based on Summed Area Tables and a carefully optimized partitioning of the plane into rectangular regions, resulting in a pseudo-linear asymptotic complexity with the number of pixels (against a quadratic straightforward computation of ACE). Our experimental evaluation confirms both the speedups and high accuracy, reaching lower approximation errors than existing approaches. We provide a publicly available reference implementation of our algorithm.

Abstract:
The goal of dynamic scene deblurring is to remove the motion blur presented in a given image. To recover the details from the severe blurs, conventional convolutional neural networks (CNNs) based methods typically increase the number of convolution layers, kernel-size, or different scale images to enlarge the receptive field. However, these methods neglect the non-uniform nature of blurs, and cannot extract varied local and global information. Unlike the CNNs-based methods, we propose a Transformer-based model for image deblurring, named SharpFormer, that directly learns long-range dependencies via a novel Transformer module to overcome large blur variations. Transformer is good at learning global information but is poor at capturing local information. To overcome this issue, we design a novel Locality preserving Transformer (LTransformer) block to integrate sufficient local information into global features. In addition, to effectively apply LTransformer to the medium-resolution features, a hybrid block is introduced to capture intermediate mixed features. Furthermore, we use a dynamic convolution (DyConv) block, which aggregates multiple parallel convolution kernels to handle the non-uniform blur of inputs. We leverage a powerful two-stage attentive framework composed of the above blocks to learn the global, hybrid, and local features effectively. Extensive experiments on the GoPro and REDS datasets show that the proposed SharpFormer performs favourably against the state-of-the-art methods in blurred image restoration.

Abstract:
X-radiography (X-ray imaging) is a widely used imaging technique in art investigation. It can provide information about the condition of a painting as well as insights into an artist’s techniques and working methods, often revealing hidden information invisible to the naked eye. X-radiograpy of double-sided paintings results in a mixed X-ray image and this paper deals with the problem of separating this mixed image. Using the visible color images (RGB images) from each side of the painting, we propose a new Neural Network architecture, based upon ‘connected’ auto-encoders, designed to separate the mixed X-ray image into two simulated X-ray images corresponding to each side. This connected auto-encoders architecture is such that the encoders are based on convolutional learned iterative shrinkage thresholding algorithms (CLISTA) designed using algorithm unrolling techniques, whereas the decoders consist of simple linear convolutional layers; the encoders extract sparse codes from the visible image of the front and rear paintings and mixed X-ray image, whereas the decoders reproduce both the original RGB images and the mixed X-ray image. The learning algorithm operates in a totally self-supervised fashion without requiring a sample set that contains both the mixed X-ray images and the separated ones. The methodology was tested on images from the double-sided wing panels of the Ghent Altarpiece, painted in 1432 by the brothers Hubert and Jan van Eyck. These tests show that the proposed approach outperforms other state-of-the-art X-ray image separation methods for art investigation applications.

Abstract:
Measuring the similarity of two images is of crucial importance in computer vision. Class agnostic common object detection is a nascent research topic about mining image similarity, which aims to detect common object pairs from two images without category information. This task is general and less restrictive which explores the similarity between objects and can further describe the commonality of image pairs at the object level. However, previous works suffer from features with low discrimination caused by the lack of category information. Moreover, most existing methods compare objects extracted from two images in a simple and direct way, ignoring the internal relationships between objects in the two images. To overcome these limitations, in this paper, we propose a new framework called TransWeaver, which learns intrinsic relationships between objects. Our TransWeaver takes image pairs as input and flexibly captures the inherent correlation between candidate objects from two images. It consists of two modules (i.e., the representation-encoder and the weave-decoder) and captures efficient context information by weaving image pairs to make them interact with each other. The representation-encoder is used for representation learning, which can obtain more discriminative representations for candidate proposals. Furthermore, the weave-decoder weaves the objects from two images and is able to explore the inter-image and intra-image context information at the same time, bringing a better object matching ability. We reorganize the PASCAL VOC, COCO, and Visual Genome datasets to obtain training and testing image pairs. Extensive experiments demonstrate the effectiveness of the proposed TransWeaver which achieves state-of-the-art performance on all datasets.

Abstract:
Weakly supervised semantic segmentation (WSSS) models relying on class activation maps (CAMs) have achieved desirable performance comparing to the non-CAMs-based counterparts. However, to guarantee WSSS task feasible, we need to generate pseudo labels by expanding the seeds from CAMs which is complex and time-consuming, thus hindering the design of efficient end-to-end (single-stage) WSSS approaches. To tackle the above dilemma, we resort to the off-the-shelf and readily accessible saliency maps for directly obtaining pseudo labels given the image-level class labels. Nevertheless, the salient regions may contain noisy labels and cannot seamlessly fit the target objects, and saliency maps can only be approximated as pseudo labels for simple images containing single-class objects. As such, the achieved segmentation model with these simple images cannot generalize well to the complex images containing multi-class objects. To this end, we propose an end-to-end multi-granularity denoising and bidirectional alignment (MDBA) model, to alleviate the noisy label and multi-class generalization issues. Specifically, we propose the online noise filtering and progressive noise detection modules to tackle image-level and pixel-level noise, respectively. Moreover, a bidirectional alignment mechanism is proposed to reduce the data distribution gap at both input and output space with simple-to-complex image synthesis and complex-to-simple adversarial learning. MDBA can reach the mIoU of 69.5% and 70.2% on validation and test sets for the PASCAL VOC 2012 dataset. The source codes and models have been made available at https://github.com/NUST-Machine-Intelligence-Laboratory/MDBA.

Abstract:
We address the problem of referring image segmentation that aims to generate a mask for the object specified by a natural language expression. Many recent works utilize Transformer to extract features for the target object by aggregating the attended visual regions. However, the generic attention mechanism in Transformer only uses the language input for attention weight calculation, which does not explicitly fuse language features in its output. Thus, its output feature is dominated by vision information, which limits the model to comprehensively understand the multi-modal information, and brings uncertainty for the subsequent mask decoder to extract the output mask. To address this issue, we propose Multi-Modal Mutual Attention (M3Att) and Multi-Modal Mutual Decoder ( \mathrm M^3Dec ) that better fuse information from the two input modalities. Based on \mathrm M^3Dec , we further propose Iterative Multi-modal Interaction (IMI) to allow continuous and in-depth interactions between language and vision features. Furthermore, we introduce Language Feature Reconstruction (LFR) to prevent the language information from being lost or distorted in the extracted feature. Extensive experiments show that our proposed approach significantly improves the baseline and outperforms state-of-the-art referring image segmentation methods on RefCOCO series datasets consistently.

Abstract:
Both salient object detection (SOD) and camouflaged object detection (COD) are typical object segmentation tasks. They are intuitively contradictory, but are intrinsically related. In this paper, we explore the relationship between SOD and COD, and then borrow successful SOD models to detect camouflaged objects to save the design cost of COD models. The core insight is that both SOD and COD leverage two aspects of information: object semantic representations for distinguishing object and background, and context attributes that decide object category. Specifically, we start by decoupling context attributes and object semantic representations from both SOD and COD datasets through designing a novel decoupling framework with triple measure constraints. Then, we transfer saliency context attributes to the camouflaged images through introducing an attribute transfer network. The generated weakly camouflaged images can bridge the context attribute gap between SOD and COD, thereby improving the SOD models’ performances on COD datasets. Comprehensive experiments on three widely-used COD datasets verify the ability of the proposed method. Code and model are available at: https://github.com/wdzhao123/SAT.

Abstract:
Well-known deep learning (DL) is widely used in fusion based hyperspectral image super-resolution (HS-SR). However, DL-based HS-SR models have been designed mostly using off-the-shelf components from current deep learning toolkits, which lead to two inherent challenges: i) they have largely ignored the prior information contained in the observed images, which may cause the output of the network to deviate from the general prior configuration; ii) they are not specifically designed for HS-SR, making it hard to intuitively understand its implementation mechanism and therefore uninterpretable. In this paper, we propose a noise prior knowledge informed Bayesian inference network for HS-SR. Instead of designing a “black-box” deep model, our proposed network, termed as BayeSR, reasonably embeds the Bayesian inference with the Gaussian noise prior assumption to the deep neural network. In particular, we first construct a Bayesian inference model with the Gaussian noise prior assumption that can be solved iteratively by the proximal gradient algorithm, and then convert each operator involved in the iterative algorithm into a specific form of network connection to construct an unfolding network. In the process of network unfolding, based on the characteristics of the noise matrix, we ingeniously convert the diagonal noise matrix operation which represents the noise variance of each band into the channel attention. As a result, the proposed BayeSR explicitly encodes the prior knowledge possessed by the observed images and considers the intrinsic generation mechanism of HS-SR through the whole network flow. Qualitative and quantitative experimental results demonstrate the superiority of the proposed BayeSR against some state-of-the-art methods.

Abstract:
Facial action unit (AU) detection, aiming to classify AU present in the facial image, has long suffered from insufficient AU annotations. In this paper, we aim to mitigate this data scarcity issue by learning AU representations from a large number of unlabelled facial videos in a contrastive learning paradigm. We formulate the self-supervised AU representation learning signals in two-fold: 1) AU representation should be frame-wisely discriminative within a short video clip; 2) Facial frames sampled from different identities but show analogous facial AUs should have consistent AU representations. As to achieve these goals, we propose to contrastively learn the AU representation within a video clip and devise a cross-identity reconstruction mechanism to learn the person-independent representations. Specially, we adopt a margin-based temporal contrastive learning paradigm to perceive the temporal AU coherence and evolution characteristics within a clip that consists of consecutive input facial frames. Moreover, the cross-identity reconstruction mechanism facilitates pushing the faces from different identities but show analogous AUs close in the latent embedding space. Experimental results on three public AU datasets demonstrate that the learned AU representation is discriminative for AU detection. Our method outperforms other contrastive learning methods and significantly closes the performance gap between the self-supervised and supervised AU detection approaches.

Abstract:
Video object detection is a widely studied topic and has made significant progress in the past decades. However, the feature extraction and calculations in existing video object detectors demand decent imaging quality and avoidance of severe motion blur. Under extremely dark scenarios, due to limited sensor sensitivity, we have to trade off signal-to-noise ratio for motion blur compensation or vice versa, and thus suffer from performance deterioration. To address this issue, we propose to temporally multiplex a frame sequence into one snapshot and extract the cues characterizing object motion for trajectory retrieval. For effective encoding, we build a prototype for encoded capture by mounting a highly compatible programmable shutter. Correspondingly, in terms of decoding, we design an end-to-end deep network called detection from coded snapshot (DECENT) to retrieve sequential bounding boxes from the coded blurry measurements of dynamic scenes. For effective network learning, we generate quasi-real data by incorporating physically-driven noise into the temporally coded imaging model, which circumvents the unavailability of training data and with high generalization ability on real dark videos. The approach offers multiple advantages, including low bandwidth, low cost, compact setup, and high accuracy. The effectiveness of the proposed approach is experimentally validated under low illumination vision and provide a feasible way for night surveillance.

Abstract:
4D Light Field (LF) imaging, since it conveys both spatial and angular scene information, can facilitate computer vision tasks and generate immersive experiences for end-users. A key challenge in 4D LF imaging is to flexibly and adaptively represent the included spatio-angular information to facilitate subsequent computer vision applications. Recently, image over-segmentation into homogenous regions with perceptually meaningful information has been exploited to represent 4D LFs. However, existing methods assume densely sampled LFs and do not adequately deal with sparse LFs with large occlusions. Furthermore, the spatio-angular LF cues are not fully exploited in the existing methods. In this paper, the concept of hyperpixels is defined and a flexible, automatic, and adaptive representation for both dense and sparse 4D LFs is proposed. Initially, disparity maps are estimated for all views to enhance over-segmentation accuracy and consistency. Afterwards, a modified weighted K -means clustering using robust spatio-angular features is performed in 4D Euclidean space. Experimental results on several dense and sparse 4D LF datasets show competitive and outperforming performance in terms of over-segmentation accuracy, shape regularity and view consistency against state-of-the-art methods.

Abstract:
In recent years, User Generated Content (UGC) has grown dramatically in video sharing applications. It is necessary for service-providers to use video quality assessment (VQA) to monitor and control users’ Quality of Experience when watching UGC videos. However, most existing UGC VQA studies only focus on the visual distortions of videos, ignoring that the perceptual quality also depends on the accompanying audio signals. In this paper, we conduct a comprehensive study on UGC audio-visual quality assessment (AVQA) from both subjective and objective perspectives. Specially, we construct the first UGC AVQA database named SJTU-UAV database, which includes 520 in-the-wild UGC audio and video (A/V) sequences collected from the YFCC100m database. A subjective AVQA experiment is conducted on the database to obtain the mean opinion scores (MOSs) of the A/V sequences. To demonstrate the content diversity of the SJTU-UAV database, we give a detailed analysis of the SJTU-UAV database as well as other two synthetically-distorted AVQA databases and one authentically-distorted VQA database, from both the audio and video aspects. Then, to facilitate the development of AVQA fields, we construct a benchmark of AVQA models on the proposed SJTU-UAV database and other two AVQA databases, of which the benchmark models consist of AVQA models designed for synthetically distorted A/V sequences and AVQA models built through combining the popular VQA methods and audio features via support vector regressor (SVR). Finally, considering benchmark AVQA models perform poorly in assessing in-the-wild UGC videos, we further propose an effective AVQA model via jointly learning quality-aware audio and visual feature representations in the temporal domain, which is seldom investigated by existing AVQA models. Our proposed model outperforms the aforementioned benchmark AVQA models on the SJTU-UAV database and two synthetically distorted AVQA databases. The SJTU-UAV database and the code of the proposed model will be released to facilitate further research.

Abstract:
Neurologically, filter pruning is a procedure of forgetting and remembering recovering. Prevailing methods directly forget less important information from an unrobust baseline at first and expect to minimize the performance sacrifice. However, unsaturated base remembering imposes a ceiling on the slimmed model leading to suboptimal performance. And significantly forgetting at first would cause unrecoverable information loss. Here, we design a novel filter pruning paradigm termed Remembering Enhancement and Entropy-based Asymptotic Forgetting (REAF). Inspired by robustness theory, we first enhance remembering by over-parameterizing baseline with fusible compensatory convolutions which liberates pruned model from the bondage of baseline at no inference cost. Then the collateral implication between original and compensatory filters necessitates a bilateral-collaborated pruning criterion. Specifically, only when the filter has the largest intra-branch distance and its compensatory counterpart has the strongest remembering enhancement power, they are preserved. Further, Ebbinghaus curve-based asymptotic forgetting is proposed to protect the pruned model from unstable learning. The number of pruned filters is increasing asymptotically in the training procedure, which enables the remembering of pretrained weights gradually to be concentrated in the remaining filters. Extensive experiments demonstrate the superiority of REAF over many state-of-the-art (SOTA) methods. For example, REAF removes 47.55% FLOPs and 42.98% parameters of ResNet-50 only with 0.98% TOP-1 accuracy loss on ImageNet. The code is available at https://github.com/zhangxin-xd/REAF.

Abstract:
Quaternion singular value decomposition (QSVD) is a robust technique of digital watermarking that extracts high quality watermarks from watermarked images with low distortion. However, the existing QSVD-based watermarking schemes face the obstacle of “explosion of complexity” and have much room for improvement in terms of real-time, invisibility, and robustness. In this paper, we overcome such obstacle by introducing a new real structure-preserving QSVD algorithm and propose a novel QSVD-based watermarking scheme with high efficiency. Secret information is transmitted blindly by incorporating two new strategies: coefficient pair selection and adaptive embedding. The highly correlated coefficient pairs determined by the normalized cross-correlation method reduce the impact of embedding by reducing the maximum modification of the coefficient values, resulting in high fidelity of the watermarked image. Large-size 8-color binary watermark and QR code effectively verify that the proposed watermarking scheme can resist various image attacks in numerical experiments. Two keys designed by Logistic chaotic map ensure the security of the watermarking system. Under the premise of considering the correlation of color channels, the proposed watermarking scheme not only performs well in real-time and invisibility, but also has satisfactory advantages in robustness compared with the state-of-the-art methods.

Abstract:
Video object detection is a fundamental and important task in computer vision. One mainstay solution for this task is to aggregate features from different frames to enhance the detection on the current frame. Off-the-shelf feature aggregation paradigms for video object detection typically rely on inferring feature-to-feature (Fea2Fea) relations. However, most existing methods are unable to stably estimate Fea2Fea relations due to the appearance deterioration caused by object occlusion, motion blur or rare poses, resulting in limited detection performance. In this paper, we study Fea2Fea relations from a new perspective, and propose a novel dual-level graph relation network (DGRNet) for high-performance video object detection. Different from previous methods, our DGRNet innovatively leverages the residual graph convolutional network to simultaneously model Fea2Fea relations at two different levels including frame level and proposal level, which facilitates performing better feature aggregation in the temporal domain. To prune unreliable edge connections in the graph, we introduce a node topology affinity measure to adaptively evolve the graph structure by mining the local topological information of pairwise nodes. To the best of our knowledge, our DGRNet is the first video object detection method that leverages dual-level graph relations to guide feature aggregation. We conduct experiments on the ImageNet VID dataset and the results demonstrate the superiority of our DGRNet against state-of-the-art methods. Especially, our DGRNet achieves 85.0% mAP and 86.2% mAP with ResNet-101 and ResNeXt-101, respectively.

Abstract:
Learning-based edge detection usually suffers from predicting thick edges. Through extensive quantitative study with a new edge crispness measure, we find that noisy human-labeled edges are the main cause of thick predictions. Based on this observation, we advocate that more attention should be paid on label quality than on model design to achieve crisp edge detection. To this end, we propose an effective Canny-guided refinement of human-labeled edges whose result can be used to train crisp edge detectors. Essentially, it seeks for a subset of over-detected Canny edges that best align human labels. We show that several existing edge detectors can be turned into a crisp edge detector through training on our refined edge maps. Experiments demonstrate that deep models trained with refined edges achieve significant performance boost of crispness from 17.4% to 30.6%. With the PiDiNet backbone, our method improves ODS and OIS by 12.2% and 12.6% on the Multicue dataset, respectively, without relying on non-maximal suppression. We further conduct experiments and show the superiority of our crisp edge detection for optical flow estimation and image segmentation.

Abstract:
The speed of tracking-by-detection (TBD) greatly depends on the number of running a detector because the detection is the most expensive operation in TBD. In many practical cases, multi-object tracking (MOT) can be, however, achieved based tracking-by-motion (TBM) only. This is a possible solution without much loss of MOT accuracy when the variations of object cardinality and motions are not much within consecutive frames. Therefore, the MOT problem can be transformed to find the best TBD and TBM mechanism. To achieve it, we propose a novel decision coordinator for MOT (Decode-MOT) which can determine the best TBD/TBM mechanism according to scene and tracking contexts. In specific, our Decode-MOT learns tracking and scene contextual similarities between frames. Because the contextual similarities can vary significantly according to the used trackers and tracking scenes, we learn the Decode-MOT via self-supervision. The evaluation results on MOT challenge datasets prove that our method can boost the tracking speed greatly while keeping the state-of-the-art MOT accuracy. Our code will be available at https://github.com/reussite-cv/Decode-MOT.

Abstract:
Sketch classification models have been extensively investigated by designing a task-driven deep neural network. Despite their successful performances, few works have attempted to explain the prediction of sketch classifiers. To explain the prediction of classifiers, an intuitive way is to visualize the activation maps via computing the gradients. However, visualization based explanations are constrained by several factors when directly applying them to interpret the sketch classifiers: (i) low-semantic visualization regions for human understanding. and (ii) neglecting of the inter-class correlations among distinct categories. To address these issues, we introduce a novel explanation method to interpret the decision of sketch classifiers with stroke-level evidences. Specifically, to achieve stroke-level semantic regions, we first develop a sketch parser that parses the sketch into strokes while preserving their geometric structures. Then, we design a counterfactual map generator to discover the stroke-level principal components for a specific category. Finally, based on the counterfactual feature maps, our model could explain the question of “why the sketch is classified as X” by providing positive and negative semantic explanation evidences. Experiments conducted on two public sketch benchmarks, Sketchy-COCO and TU-Berlin, demonstrate the effectiveness of our proposed model. Furthermore, our model could provide more discriminative and human understandable explanations compared with these existing works.

Abstract:
Due to the light absorption and scattering induced by the water medium, underwater images usually suffer from some degradation problems, such as low contrast, color distortion, and blurring details, which aggravate the difficulty of downstream underwater understanding tasks. Therefore, how to obtain clear and visually pleasant images has become a common concern of people, and the task of underwater image enhancement (UIE) has also emerged as the times require. Among existing UIE methods, Generative Adversarial Networks (GANs) based methods perform well in visual aesthetics, while the physical model-based methods have better scene adaptability. Inheriting the advantages of the above two types of models, we propose a physical model-guided GAN model for UIE in this paper, referred to as PUGAN. The entire network is under the GAN architecture. On the one hand, we design a Parameters Estimation subnetwork (Par-subnet) to learn the parameters for physical model inversion, and use the generated color enhancement image as auxiliary information for the Two-Stream Interaction Enhancement sub-network (TSIE-subnet). Meanwhile, we design a Degradation Quantization (DQ) module in TSIE-subnet to quantize scene degradation, thereby achieving reinforcing enhancement of key regions. On the other hand, we design the Dual-Discriminators for the style-content adversarial constraint, promoting the authenticity and visual aesthetics of the results. Extensive experiments on three benchmark datasets demonstrate that our PUGAN outperforms state-of-the-art methods in both qualitative and quantitative metrics. The code and results can be found from the link of https://rmcong.github.io/proj_PUGAN.html.

Abstract:
The development of deep learning based image representation learning (IRL) methods has attracted great attention for various image understanding problems. Most of these methods require the availability of a set of high quantity and quality of annotated training images, which can be time-consuming, complex and costly to gather. To reduce labeling costs, crowdsourced data, automatic labeling procedures or citizen science projects can be considered. However, such approaches increase the risk of including label noise in training data. It may result in overfitting on noisy labels when discriminative reasoning is employed as in most of the existing methods. This leads to sub-optimal learning procedures, and thus inaccurate characterization of images. To address this issue, in this paper, we introduce a generative reasoning integrated label noise robust deep representation learning (GRID) approach. The proposed GRID approach aims to model the complementary characteristics of discriminative and generative reasoning for IRL under noisy labels. To this end, we first integrate generative reasoning into discriminative reasoning through a supervised variational autoencoder. This allows the proposed GRID approach to automatically detect training samples with noisy labels. Then, through our label noise robust hybrid representation learning strategy, GRID adjusts the whole learning procedure for IRL of these samples through generative reasoning and that of the other samples through discriminative reasoning. Our approach learns discriminative image representations while preventing interference of noisy labels during training independently from the IRL method being selected. Thus, unlike the existing label noise robust methods, GRID does not depend on the type of annotation, label noise, neural network architecture, loss function or learning task, and thus can be directly utilized for various image understanding problems. Experimental results show the effectiveness of the proposed GRID approach compared to the state-of-the-art methods. The code of the proposed approach is publicly available at https://github.com/gencersumbul/GRID.

Abstract:
Image enhancement aims at improving the aesthetic visual quality of photos by retouching the color and tone, and is an essential technology for professional digital photography. Recent years deep learning-based image enhancement algorithms have achieved promising performance and attracted increasing popularity. However, typical efforts attempt to construct a uniform enhancer for all pixels’ color transformation. It ignores the pixel differences between different content (e.g., sky, ocean, etc.) that are significant for photographs, causing unsatisfactory results. In this paper, we propose a novel learnable context-aware 4-dimensional lookup table (4D LUT), which achieves content-dependent enhancement of different contents in each image via adaptively learning of photo context. In particular, we first introduce a lightweight context encoder and a parameter encoder to learn a context map for the pixel-level category and a group of image-adaptive coefficients, respectively. Then, the context-aware 4D LUT is generated by integrating multiple basis 4D LUTs via the coefficients. Finally, the enhanced image can be obtained by feeding the source image and context map into fused context-aware 4D LUT via quadrilinear interpolation. Compared with traditional 3D LUT, i.e., RGB mapping to RGB, which is usually used in camera imaging pipeline systems or tools, 4D LUT, i.e., RGBC(RGB+Context) mapping to RGB, enables finer control of color transformations for pixels with different content in each image, even though they have the same RGB values. Experimental results demonstrate that our method outperforms other state-of-the-art methods in widely-used benchmarks.

Abstract:
Diagram Question Answering (DQA) aims to correctly answer questions about given diagrams, which demands an interplay of good diagram understanding and effective reasoning. However, the same appearance of objects in diagrams can express different semantics. This kind of visual semantic ambiguity problem makes it challenging to represent diagrams sufficiently for better understanding. Moreover, since there are questions about diagrams from different perspectives, it is also crucial to perform flexible and adaptive reasoning on content-rich diagrams. In this paper, we propose a Disentangled Adaptive Visual Reasoning Network for DQA, named DisAVR, to jointly optimize the dual-process of representation and reasoning. DisAVR mainly comprises three modules: improved region feature learning, question parsing, and disentangled adaptive reasoning. Specifically, the improved region feature learning module is designed to first learn robust diagram representation by integrating detail-aware patch features and semantically-explicit text features with region features. Subsequently, the question parsing module decomposes the question into three types of question guidance including region, spatial relation and semantic relation guidance to dynamically guide subsequent reasoning. Next, the disentangled adaptive reasoning module decomposes the whole reasoning process by employing three visual reasoning cells to construct a soft fully-connected multi-layer stacked routing space. These three cells in each layer reason over object regions, semantic and spatial relations in the diagram under the corresponding question guidance. Moreover, an adaptive routing mechanism is designed to flexibly explore more optimal reasoning paths for specific diagram-question pairs. Extensive experiments on three DQA datasets demonstrate the superiority of our DisAVR.

Abstract:
We propose a Meta Learning on Randomized Transformations (MLRT) to learn domain invariant object detectors. Domain generalization is a problem about learning an invariant model from multiple source domains which can generalize well on unseen target domains. This problem is overlooked in object detection field, which is formally named as domain generalizable object detection (DGOD). Moreover, existing domain generalization methods have the problem of domain bias so that they can easily overfit to some specific domain (e.g., source domain). In order to alleviate the domain bias, in MLRT model, a novel randomized spectrum transformation (RST) module is proposed to increase the diversity of source domains. Specifically, RST randomizes the domain specific information of images in frequency-space, which can transform single or multiple source domains into various new domains. Besides, we observe a prior that the gradient imbalance degree among domains can also reflect the domain bias. Therefore, we further propose to alleviate the domain bias from the perspective of gradient balancing, and a novel gradient weighting (GW) module is proposed to balance the gradients over all domains via a hand-crafted weight. Finally we embed our RST and GW into a general meta learning framework and the proposed MLRT model is formalized for DGOD task. Extensive experiments are conducted on six benchmarks, and our method achieves the SOTA performance.

Abstract:
Robust vision restoration of underwater images remains a challenge. Owing to the lack of well-matched underwater and in-air images, unsupervised methods based on the cyclic generative adversarial framework have been widely investigated in recent years. However, when using an end-to-end unsupervised approach with only unpaired image data, mode collapse could occur, and the color correction of the restored images is usually poor. In this paper, we propose a data- and physics-driven unsupervised architecture to perform underwater image restoration from unpaired underwater and in-air images. For effective color correction and quality enhancement, an underwater image degeneration model must be explicitly constructed based on the optically unambiguous physics law. Thus, we employ the Jaffe-McGlamery degeneration theory to design a generator and use neural networks to model the process of underwater visual degeneration. Furthermore, we impose physical constraints on the scene depth and degeneration factors for backscattering estimation to avoid the vanishing gradient problem during the training of the hybrid physical-neural model. Experimental results show that the proposed method can be used to perform high-quality restoration of unconstrained underwater images without supervision. On multiple benchmarks, the proposed method outperforms several state-of-the-art supervised and unsupervised approaches. We demonstrate that our method yields encouraging results in real-world applications.

Abstract:
Text-based visual question answering (TextVQA) faces the significant challenge of avoiding redundant relational inference. To be specific, a large number of detected objects and optical character recognition (OCR) tokens result in rich visual relationships. Existing works take all visual relationships into account for answer prediction. However, there are three observations: (1) a single subject in the images can be easily detected as multiple objects with distinct bounding boxes (considered repetitive objects). The associations between these repetitive objects are superfluous for answer reasoning; (2) two spatially distant OCR tokens detected in the image frequently have weak semantic dependencies for answer reasoning; and (3) the co-existence of nearby objects and tokens may be indicative of important visual cues for predicting answers. Rather than utilizing all of them for answer prediction, we make an effort to identify the most important connections or eliminate redundant ones. We propose a sparse spatial graph network (SSGN) that introduces a spatially aware relation pruning technique to this task. As spatial factors for relation measurement, we employ spatial distance, geometric dimension, overlap area, and DIoU for spatially aware pruning. We consider three visual relationships for graph learning: object-object, OCR-OCR tokens, and object-OCR token relationships. SSGN is a progressive graph learning architecture that verifies the pivotal relations in the correlated object-token sparse graph, and then in the respective object-based sparse graph and token-based sparse graph. Experiment results on TextVQA and ST-VQA datasets demonstrate that SSGN achieves promising performances. And some visualization results further demonstrate the interpretability of our method.

Abstract:
Daytime visible modality (RGB) and night-time infrared (IR) modality person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. However, training a cross-modality ReID model requires plenty of cross-modality (visible-infrared) identity labels that are more expensive than single-modality person ReID. To alleviate this issue, this paper studies unsupervised domain adaptive visible infrared person re-identification (UDA-VI-ReID) task without the reliance on any cross-modality annotation. To transfer learned knowledge from the labelled visible source domain to the unlabelled visible-infrared target domain, we propose a Translation, Association and Augmentation (TAA) framework. Specifically, the modality translator is firstly utilized to transfer visible image to infrared image, formulating generated visible-infrared image pairs for cross-modality supervised training. A Robust Association and Mutual Learning (RAML) module is then designed to exploit the underlying relations between visible and infrared modalities for label noise modeling. Moreover, a Translation Supervision and Feature Augmentation (TSFA) module is designed to enhance the discriminability by enriching the supervision with feature augmentation and modality translation. The extensive experimental results demonstrate that our method significantly outperforms current state-of-the-art unsupervised methods under various settings, and even surpasses some supervised counterparts, providing a powerful baseline for UDA-VI-ReID.

Abstract:
Visual grounding, aiming to align image regions with textual queries, is a fundamental task for cross-modal learning. We study the weakly supervised visual grounding, where only image-text pairs at a coarse-grained level are available. Due to the lack of fine-grained correspondence information, existing approaches often encounter matching ambiguity. To overcome this challenge, we introduce the cycle consistency constraint into region-phrase pairs, which strengthens correlated pairs and weakens unrelated pairs. This cycle pairing makes use of the bidirectional association between image regions and text phrases to alleviate matching ambiguity. Furthermore, we propose a parallel grounding framework, where backbone networks and subsequent relation modules extract individual and contextual representations to calculate context-free and context-aware similarities between regions and phrases separately. Those two representations characterize visual/linguistic individual concepts and inter-relationships, respectively, and then complement each other to achieve cross-modal alignment. The whole framework is trained by minimizing an image-text contrastive loss and a cycle consistency loss. During inference, the above two similarities are fused to give the final region-phrase matching score. Experiments on five popular datasets about visual grounding demonstrate a noticeable improvement in our method. The source code is available at https://github.com/Evergrow/WSVG.

Abstract:
Recently, metric-based meta-learning methods have been effectively applied to few-shot image classification. These methods classify images based on the relationship between samples in an embedding space, avoiding over-fitting that can occur when training classifiers with limited samples. However, finding an embedding space with good generalization properties remains a challenge. Our work highlights that having an initial manifold space that preserves sample neighbor relationships can prevent the metric model from reaching a suboptimal solution. We propose a feature learning method that leverages Instance Neighbor Constraints (INC). This theory is thoroughly evaluated and analyzed through experiments, demonstrating its effectiveness in improving the efficiency of learning and the overall performance of the model. We further integrate the INC into an alternate optimization training framework (AOT) that leverages both batch learning and episode learning to better optimize the metric-based model. We conduct extensive experiments on 5-way 1-shot and 5-way 5-shot settings on four popular few-shot image benchmarks: miniImageNet, tieredImageNet, Fewshot-CIFAR100 (FC100), and Caltech-UCSD Birds-200-2011(CUB). Results show that our method achieves consistent performance gains on benchmarks and state-of-the-art performance. Our findings suggest that initializing the embedding space appropriately and leveraging both batch and episode learning can significantly improve few-shot learning performance.

Abstract:
Adaptive sampling that exploits the spatiotemporal redundancy in videos is critical for always-on action recognition on wearable devices with limited computing and battery resources. The commonly used fixed sampling strategy is not context-aware and may under-sample the visual content, and thus adversely impacts both computation efficiency and accuracy. Inspired by the concepts of foveal vision and pre-attentive processing from the human visual perception mechanism, we introduce a novel adaptive spatiotemporal sampling scheme for efficient action recognition. Our system pre-scans the global scene context at low-resolution and decides to skip or request high-resolution features at salient regions for further processing. We validate the system on EPIC-KITCHENS and UCF-101 (split-1) datasets for action recognition, and show that our proposed approach can greatly speed up inference with a tolerable loss of accuracy compared with those from state-of-the-art baselines. Source code is available in https://github.com/knmac/adaptive_spatiotemporal.

Abstract:
Benefiting from advances in few-shot learning techniques, their application to dense prediction tasks (e.g., segmentation) has also made great strides in the past few years. However, most existing few-shot segmentation (FSS) approaches follow a similar pipeline to that of few-shot classification, where some core components are directly exploited regardless of various properties between tasks. We note that such an ill-conceived framework introduces unnecessary information loss, which is clearly unacceptable given the already very limited training sample. To this end, we delve into the typical types of information loss and provide a reasonably effective way, namely Retain And REcover (RARE). The main focus of this paper can be summarized as follows: (i) the loss of spatial information due to global pooling; (ii) the loss of boundary information due to mask interpolation; (iii) the degradation of representational power due to sample averaging. Accordingly, we propose a series of strategies to retain/recover the avoidable/unavoidable information, such as unidirectional pooling, error-prone region focusing, and adaptive integration. Extensive experiments on two popular benchmarks (i.e., PASCAL- 5^i and COCO- 20^i ) demonstrate the effectiveness of our scheme, which is not restricted to a particular baseline approach. The ultimate goal of our work is to address different information loss problems within a unified framework, and it also exhibits superior performance compared to other methods with similar motivations. The source code will be made available at https://github.com/chunbolang/RARE.

Abstract:
Existing two-view multi-model fitting methods typically follow a two-step manner, i.e., model generation and selection, without considering their interaction. Therefore, in the first step, these methods have to generate a considerable number of instances in order to cover all desired ones, which not only offers no guarantees, but also introduces unnecessary expensive calculations. To address this challenge, this study presents a new algorithm, termed as D2Fitting, that incrementally explores dominant instances. Particularly, rather than viewing model generation and selection as two disjoint parts, D2Fitting fully considers their interaction, and thus performs these two subroutines alternatively under a simple yet effective optimization framework. This design can avoid generating too many redundant instances, thus reducing computational overhead and allowing the proposed D2Fitting being real-time. Meanwhile, we further design a novel density-guided sampler to sample high-quality minimal subsets during the model generation process, so as to fully exploit the spatial distribution of the input data. Also, to mitigate the influence of noise on the subsets sampled by the proposed sampler, a global-residual optimization strategy is investigated for the minimal subset refinement. With all the ingredients mentioned above, the proposed D2Fitting can accurately estimate the number and parameters of geometric models and efficiently segment the input data simultaneously. Extensive experiments on several public datasets demonstrate the significant superiority of D2Fitting over several state-of-the-arts.

Abstract:
Ingredient prediction has received more and more attention with the help of image processing for its diverse real-world applications, such as nutrition intake management and cafeteria self-checkout system. Existing approaches mainly focus on multi-task food category-ingredient joint learning to improve final recognition by introducing task relevance, while seldom pay attention to making good use of inherent characteristics of ingredients independently. Actually, there are two issues for ingredient prediction. First, compared with fine-grained food recognition, ingredient prediction needs to extract more comprehensive features of the same ingredient and more detailed features of various ingredients from different regions of the food image. Because it can help understand various food compositions and distinguish the differences within ingredient features. Second, the ingredient distributions are extremely unbalanced. Existing loss functions can not simultaneously solve the imbalance between positive-negative samples belonging to each ingredient and significant differences among all classes. To solve these problems, we propose a novel framework named Class-Adaptive Context Learning Network (CACLNet) for ingredient prediction. In order to extract more comprehensive and detailed features, we introduce Ingredient Context Learning (ICL) to reduce the negative impact of complex background in food images and construct internal spatial connections among ingredient regions of food objects in a self-supervised manner, which can strengthen the contacts of the same ingredients through region interactions. In order to solve the imbalance of different classes among ingredients, we propose one novel Class-Adaptive Asymmetric Loss (CAAL) to focus on various ingredient classes adaptively. Besides, considering that the over-suppression of negative samples will over-fit positive samples of those rare ingredients, CAAL alleviates this continuous suppression according to the imbalanced ratios based on gradients while maintaining the contribution of positive samples by lesser suppression. Extensive evaluation on two popular benchmark datasets (Vireo Food-172, UEC Food-100) demonstrates our proposed method achieves the state-of-the-art performance. Further qualitative analysis and visualization show the effectiveness of our method. Code and models are available at https://123.57.42.89/codes/CACLNet/index.html.

Abstract:
In this paper, we explore the problem of 3D point cloud representation-based view synthesis from a set of sparse source views. To tackle this challenging problem, we propose a new deep learning-based view synthesis paradigm that learns a locally unified 3D point cloud from source views. Specifically, we first construct sub-point clouds by projecting source views to 3D space based on their depth maps. Then, we learn the locally unified 3D point cloud by adaptively fusing points at a local neighborhood defined on the union of the sub-point clouds. Besides, we also propose a 3D geometry-guided image restoration module to fill the holes and recover high-frequency details of the rendered novel views. Experimental results on three benchmark datasets demonstrate that our method can improve the average PSNR by more than 4 dB while preserving more accurate visual details, compared with state-of-the-art view synthesis methods. The code will be publicly available at https://github.com/mengyou2/PCVS.

Abstract:
By exploring the localizable representations in deep CNN, weakly supervised object localization (WSOL) methods could determine the position of the object in each image just trained by the classification task. However, the partial activation problem caused by the discriminant function makes the network unable to locate objects accurately. To alleviate this problem, we propose Structure-Preserved Attention Activated Network (SPA2Net), a simple and effective one-stage WSOL framework to explore the ability of structure preservation of deep features. Different from traditional WSOL approaches, we decouple the object localization task from the classification branch to reduce their mutual influence by involving a localization branch which is online refined by a self-supervised structural-preserved localization mask. Specifically, we employ the high-order self-correlation as structural prior to enhance the perception of spatial interaction within convolutional features. By succinctly combining the structural prior with spatial attention, activations by SPA2Net will spread from part to the whole object during training. To avoid the structure-missing issue caused by the classification network, we furthermore utilize the restricted activation loss (RAL) to distinguish the difference between foreground and background in the channel dimension. In conjunction with the self-supervised localization branch, SPA2Net can directly predict the class-irrelevant localization map while prompting the network to pay more attention to the target region for accurate localization. Extensive experiments on two publicly available benchmarks, including CUB-200-2011 and ILSVRC, show that our SPA2Net achieves substantial and consistent performance gains compared with baseline approaches. The code and models are available at https://github.com/MsterDC/SPA2Net.

Abstract:
Face editing represents a popular research topic within the computer vision and image processing communities. While significant progress has been made recently in this area, existing solutions: (i) are still largely focused on low-resolution images, (ii) often generate editing results with visual artefacts, or (iii) lack fine-grained control over the editing procedure and alter multiple (entangled) attributes simultaneously, when trying to generate the desired facial semantics. In this paper, we aim to address these issues through a novel editing approach, called MaskFaceGAN that focuses on local attribute editing. The proposed approach is based on an optimization procedure that directly optimizes the latent code of a pre-trained (state-of-the-art) Generative Adversarial Network (i.e., StyleGAN2) with respect to several constraints that ensure: (i) preservation of relevant image content, (ii) generation of the targeted facial attributes, and (iii) spatially–selective treatment of local image regions. The constraints are enforced with the help of an (differentiable) attribute classifier and face parser that provide the necessary reference information for the optimization procedure. MaskFaceGAN is evaluated in extensive experiments on the FRGC, SiblingsDB-HQf, and XM2VTS datasets and in comparison with several state-of-the-art techniques from the literature. Our experimental results show that the proposed approach is able to edit face images with respect to several local facial attributes with unprecedented image quality and at high-resolutions ( 1024× 1024 ), while exhibiting considerably less problems with attribute entanglement than competing solutions. The source code is publicly available from: https://github.com/MartinPernus/MaskFaceGAN.

Abstract:
Dynamic point cloud is a volumetric visual data representing realistic 3D scenes for virtual reality and augmented reality applications. However, its large data volume has been the bottleneck of data processing, transmission, and storage, which requires effective compression. In this paper, we propose a Perceptually Weighted Rate-Distortion Optimization (PWRDO) scheme for Video-based Point Cloud Compression (V-PCC), which aims to minimize the perceptual distortion of reconstructed point cloud at the given bit rate. Firstly, we propose a general framework of perceptually optimized V-PCC to exploit visual redundancies in point clouds. Secondly, a multi-scale Projection based Point Cloud quality Metric (PPCM) is proposed to measure the perceptual quality of 3D point cloud. The PPCM model comprises 3D-to-2D patch projection, multi-scale structural distortion measurement, and fusion model. Approximations and simplifications of the proposed PPCM are also presented for both V-PCC integration and low complexity. Thirdly, based on the simplified PPCM model, we propose a PWRDO scheme with Lagrange multiplier adaptation, which is incorporated into the V-PCC to enhance the coding efficiency. Experimental results show that the proposed PPCM models can be used as standalone quality metrics, and they are able to achieve higher consistency with the human subjective scores than the state-of-the-art objective visual quality metrics. Also, compared with the latest V-PCC reference model, the proposed PWRDO-based V-PCC scheme achieves an average bit rate reduction of 13.52%, 8.16%, 10.56% and 9.54%, respectively, in terms of four objective visual quality metrics for point clouds. It is significantly superior to the state-of-the-art coding algorithms. The computational complexity of the proposed PWRDO increases by 1.71% and 0.05% on average to the V-PCC encoder and decoder, respectively, which is negligible. The source codes of the PPCM and PWRDO schemes are available at https://github.com/VVCodec/PPCM-PWRDO.

Abstract:
Existing vehicle re-identification methods mainly rely on the single query, which has limited information for vehicle representation and thus significantly hinders the performance of vehicle Re-ID in complicated surveillance networks. In this paper, we propose a more realistic and easily accessible task, called multi-query vehicle Re-ID, which leverages multiple queries to overcome viewpoint limitation of single one. Based on this task, we make three major contributions. First, we design a novel viewpoint-conditioned network (VCNet), which adaptively combines the complementary information from different vehicle viewpoints, for multi-query vehicle Re-ID. Moreover, to deal with the problem of missing vehicle viewpoints, we propose a cross-view feature recovery module which recovers the features of the missing viewpoints by learnt the correlation between the features of available and missing viewpoints. Second, we create a unified benchmark dataset, taken by 6142 cameras from a real-life transportation surveillance system, with comprehensive viewpoints and large number of crossed scenes of each vehicle for multi-query vehicle Re-ID evaluation. Finally, we design a new evaluation metric, called mean cross-scene precision (mCSP), which measures the ability of cross-scene recognition by suppressing the positive samples with similar viewpoints from the same camera. Comprehensive experiments validate the superiority of the proposed method against other methods, as well as the effectiveness of the designed metric in the evaluation of multi-query vehicle Re-ID. The codes and dataset are available at: https://github.com/zhangchaobin001/VCNet

Abstract:
Behavior sequences are generated by a series of spatio-temporal interactions and have a high-dimensional nonlinear manifold structure. Therefore, it is difficult to learn 3D behavior representations without relying on supervised signals. To this end, self-supervised learning methods can be used to explore the rich information contained in the data itself. Context-context contrastive self-supervised methods construct the manifold embedded in Euclidean space by learning the distance relationship between data, and find the geometric distribution of data. However, traditional Euclidean space is difficult to express context joint features. In order to obtain an effective global representation from the relationship between data under unlabeled conditions, this paper adopts contrastive learning to compare global feature, and proposes a self-supervised learning method based on hyperbolic embedding to mine the nonlinear relationship of behavior trajectories. This method adopts the framework of discarding negative samples, which overcomes the shortcomings of the paradigm based on positive and negative samples that pull similar data away in the feature space. Meanwhile, the output of the network is embedded in a hyperbolic space, and a multi-layer perceptron is added to convert the entire module into a homotopic mapping by using the geometric properties of operations in the hyperbolic space, so as to obtain homotopy invariant knowledge. The proposed method combines the geometric properties of hyperbolic manifolds and the equivariance of homotopy groups to promote better supervised signals for the network, which improves the performance of unsupervised learning.

Abstract:
An essential need for accurate visual object tracking is to capture better correlations between the tracking target and the search region. However, the dominant Siamese-based trackers are limited to producing dense similarity maps at once via a cross-correlations operation, ignoring to remedy the contamination caused by erroneous or ambiguous matches. In this paper, we propose a novel tracker, termed neighborhood consensus constraint-based siamese tracker (NCSiam), which takes the idea of neighborhood consensus constraint to refine the produced correlation maps. The intuition behind our approach is that we can support the nearby erroneous or ambiguous matches by analyzing a larger context of the scene that contains a unique match. Specifically, we devise a 4D convolution-based multi-level similarity refinement (MLSR) strategy. Taking the primary similarity maps obtained from a cross-correlation as input, MLSR acquires reliable matches by analyzing neighborhood consensus patterns in 4D space, thus enhancing the discriminability between the tracking target and the distractors. Besides, traditional Siamese-based trackers directly perform classification and regression on similarity response maps which discard appearance or semantic information. Therefore, an appearance affinity decoder (AAD) is developed to take full advantage of the semantic information of the search region. To further improve performance, we design a task-specific disentanglement (TSD) module to decouple the learned representations into classification-specific and regression-specific embeddings. Extensive experiments are conducted on six challenging benchmarks, including GOT-10k, TrackingNet, LaSOT, UAV123, OTB2015, and VOT2020. The results demonstrate the effectiveness of our method. The code will be available at https://github.com/laybebe/NCSiam.

Abstract:
Accurate segmentation of power lines in various aerial images is very important for UAV flight safety. The complex background and very thin structures of power lines, however, make it an inherently difficult task in computer vision. This paper presents PLGAN, a simple yet effective method based on generative adversarial networks, to segment power lines from aerial images with different backgrounds. Instead of directly using the adversarial networks to generate the segmentation, we take their certain decoding features and embed them into another semantic segmentation network by considering more context, geometry, and appearance information of power lines. We further exploit the appropriate form of the generated images for high-quality feature embedding and define a new loss function in the Hough-transform parameter space to enhance the segmentation of very thin power lines. Extensive experiments and comprehensive analysis demonstrate that our proposed PLGAN outperforms the prior state-of-the-art methods for semantic segmentation and line detection.

Abstract:
Objects in aerial images show greater variations in scale and orientation than in other images, making them harder to detect using vanilla deep convolutional neural networks. Networks with sampling equivariance can adapt sampling from input feature maps to object transformation, allowing a convolutional kernel to extract effective object features under different transformations. However, methods such as deformable convolutional networks can only provide sampling equivariance under certain circumstances, as they sample by location. We propose sampling equivariant self-attention networks, which treat self-attention restricted to a local image patch as convolution sampling by masks instead of locations, and a transformation embedding module to improve the equivariant sampling further. We further propose a novel randomized normalization module to enhance network generalization and a quantitative evaluation metric to fairly evaluate the ability of sampling equivariance of different models. Experiments show that our model provides significantly better sampling equivariance than existing methods without additional supervision and can thus extract more effective image features. Our model achieves state-of-the-art results on the DOTA-v1.0, DOTA-v1.5, and HRSC2016 datasets without additional computations or parameters.

Abstract:
This paper focuses on skeleton-based few-shot action recognition. Since skeleton is essentially a sparse representation of human action, the feature maps extracted from it, through a standard encoder network in the few-shot condition, may not be sufficiently discriminative for some action sequences that look partially similar to each other. To address this issue, we propose a self and mutual adaptive matching (SMAM) module to convert such feature maps into more discriminative feature vectors. Our method, named as SMAM-Net, first leverages both the temporal information associated with each individual skeleton joint and the spatial relationship among them for feature extraction. Then, the SMAM module adaptively measures the similarity between labeled and query samples and further carries out feature matching within the query set to distinguish similar skeletons of various action categories. Experimental results show that the SMAM-Net outperforms other baselines on the large-scale NTU RGB + D 120 dataset in the tasks of one-shot and five-shot action recognition. We also report our results on smaller datasets including NTU RGB + D 60, SYSU and PKU-MMD to demonstrate that our method is reliable and generalises well on different datasets. Codes and the pretrained SMAM-Net will be made publicly available.

Abstract:
In recent years, deep convolutional neural networks (DCNNs) have been widely used in the task of ship target detection in synthetic aperture radar (SAR) imagery. However, the vast storage and computational cost of DCNN limits its application to spaceborne or airborne onboard devices with limited resources. In this paper, a set of lightweight detection networks for SAR ship target detection are proposed. To obtain these lightweight networks, this paper designs a network structure optimization algorithm based on the multi-objective firefly algorithm (termed NOFA). In our design, the NOFA algorithm encodes the filters of a well-performing ship target detection network into a list of probabilities, which will determine whether the lightweight network will inherit the corresponding filter structure and parameters. After that, the multi-objective firefly optimization algorithm (MFA) continuously optimizes the probability list and finally outputs a set of lightweight network encodings that can meet the different needs of the trade-off between detection network precision and size. Finally, the network pruning technology transforms the encoding that meets the task requirements into a lightweight ship target detection network. The experiments on SSDD and SDCD datasets prove that the method proposed in this paper can provide more flexible and lighter detection networks than traditional detection networks.

Abstract:
Block based motion estimation is integral to inter prediction processes performed in hybrid video codecs. Prevalent block matching based methods that are used to compute block motion vectors (MVs) rely on computationally intensive search procedures. They also suffer from the aperture problem, which tends to worsen as the block size is reduced. Moreover, the block matching criteria used in typical codecs do not account for the resulting levels of perceptual quality of the motion compensated pictures that are created upon decoding. Towards achieving the elusive goal of perceptually optimized motion estimation, we propose a search-free block motion estimation framework using a multi-stage convolutional neural network, which is able to conduct motion estimation on multiple block sizes simultaneously, using a triplet of frames as input. This composite block translation network (CBT-Net) is trained in a self-supervised manner on a large database that we created from publicly available uncompressed video content. We deploy the multi-scale structural similarity (MS-SSIM) loss function to optimize the perceptual quality of the motion compensated predicted frames. Our experimental results highlight the computational efficiency of our proposed model relative to conventional block matching based motion estimation algorithms, for comparable prediction errors. Further, when used to perform inter prediction in AV1, the MV predictions of the perceptually optimized model result in average Bjøntegaard-delta rate (BD-rate) improvements of −1.73% and −1.31% with respect to the MS-SSIM and Video Multi-Method Assessment Fusion (VMAF) quality metrics, respectively, as compared to the block matching based motion estimation system employed in the SVT-AV1 encoder.

Abstract:
Photographs taken through a glass window are susceptible to disturbances due to reflection. Therefore, single image reflection removal is crucial to image quality enhancement. In this paper, a novel learning architecture that can address this ill-posed problem is proposed. First, a novel reflection removal pipeline was designed to reconstruct the missing information caused by the camera imaging process using the proposed missing recovery network. Second, to address the issues in existing reflection removal strategies, we revisit several auxiliary priors and integrate them by defining an energy function. To solve the energy function, a convolutional neural network-based optimization scheme was proposed. Finally, we investigated the dark channel responses of reflection and clean images and found an interesting way to distinguish between these two types of images. We prove this property mathematically and propose a novel loss function called dark channel loss to improve performance. Experiments show that the proposed method outperforms state-of-the-art reflection removal methods both quantitatively and qualitatively.

Abstract:
General Continual Learning (GCL) aims at learning from non independent and identically distributed stream data without catastrophic forgetting of the old tasks that don’t rely on task boundaries during both training and testing stages. We reveal that the relation and feature deviations are crucial problems for catastrophic forgetting, in which relation deviation refers to the deficiency of the relationship among all classes in knowledge distillation, and feature deviation refers to indiscriminative feature representations. To this end, we propose a Complementary Calibration (CoCa) framework by mining the complementary model’s outputs and features to alleviate the two deviations in the process of GCL. Specifically, we propose a new collaborative distillation approach for addressing the relation deviation. It distills model’s outputs by utilizing ensemble dark knowledge of new model’s outputs and reserved outputs, which maintains the performance of old tasks as well as balancing the relationship among all classes. Furthermore, we explore a collaborative self-supervision idea to leverage pretext tasks and supervised contrastive learning for addressing the feature deviation problem by learning complete and discriminative features for all classes. Extensive experiments on six popular datasets show that our CoCa framework achieves superior performance against state-of-the-art methods. Code is available at https://github.com/lijincm/CoCa.

Abstract:
Eliminating the flickers in digital images captured by rolling shutter cameras is a fundamental and important task in computer vision applications. The flickering effect in a single image stems from the mechanism of asynchronous exposure of rolling shutters employed by cameras equipped with CMOS sensors. In an artificial lighting environment, the light intensity captured at different time intervals varies due to the fluctuation of the power grid, ultimately resulting in the flickering artifact in the image. Up to date, there are few studies related to single image deflickering. Further, it is even more challenging to remove flickers without a priori information, e.g., camera parameters or paired images. To address these challenges, we propose an unsupervised framework termed DeflickerCycleGAN, which is trained on unpaired images for end-to-end single image deflickering. Besides the cycle-consistency loss to maintain the similarity of image contents, we meticulously design another two novel loss functions, i.e., gradient loss and flicker loss, to reduce the risk of edge blurring and color distortion. Moreover, we provide a strategy to determine whether an image contains flickers or not without extra training, which leverages an ensemble methodology based on the output of two previously trained markovian discriminators. Extensive experiments on both synthetic and real datasets show that our proposed DeflickerCycleGAN not only achieves excellent performance on flicker removal in a single image but also shows high accuracy and competitive generalization ability on flicker detection, compared to that of a well-trained classifier based on ResNet50.

Abstract:
Multi-modal image registration aims to spatially align two images from different modalities to make their feature points match with each other. Captured by different sensors, the images from different modalities often contain many distinct features, which makes it challenging to find their accurate correspondences. With the success of deep learning, many deep networks have been proposed to align multi-modal images, however, they are mostly lack of interpretability. In this paper, we first model the multi-modal image registration problem as a disentangled convolutional sparse coding (DCSC) model. In this model, the multi-modal features that are responsible for alignment (RA features) are well separated from the features that are not responsible for alignment (nRA features). By only allowing the RA features to participate in the deformation field prediction, we can eliminate the interference of the nRA features to improve the registration accuracy and efficiency. The optimization process of the DCSC model to separate the RA and nRA features is then turned into a deep network, namely Interpretable Multi-modal Image Registration Network (InMIR-Net). To ensure the accurate separation of RA and nRA features, we further design an accompanying guidance network (AG-Net) to supervise the extraction of RA features in InMIR-Net. The advantage of InMIR-Net is that it provides a universal framework to tackle both rigid and non-rigid multi-modal image registration tasks. Extensive experimental results verify the effectiveness of our method on both rigid and non-rigid registrations on various multi-modal image datasets, including RGB/depth images, RGB/near-infrared (NIR) images, RGB/multi-spectral images, T1/T2 weighted magnetic resonance (MR) images and computed tomography (CT)/MR images. The codes are available at https://github.com/lep990816/Interpretable-Multi-modal-Image-Registration.

Abstract:
3D reconstruction and understanding from monocular camera is a key issue in computer vision. Recent learning-based approaches, especially multi-task learning, significantly achieve the performance of the related tasks. However a few works still have limitation in drawing loss-spatial-aware information. In this paper, we propose a novel Joint-confidence-guided network (JCNet) to simultaneously predict depth, semantic labels, surface normal, and joint confidence map for corresponding loss functions. In details, we design a Joint Confidence Fusion and Refinement (JCFR) module to achieve multi-task feature fusion in the unified independent space, which can also absorb the geometric-semantic structure feature in the joint confidence map. We use confidence-guided uncertainty generated by the joint confidence map to supervise the multi-task prediction across the spatial and channel dimensions. To alleviate the training attention imbalance among different loss functions or spatial regions, the Stochastic Trust Mechanism (STM) is designed to stochastically modify the elements of joint confidence map in the training phase. Finally, we design a calibrating operation to alternately optimize the joint confidence branch and the other parts of JCNet to avoid overfiting. The proposed methods achieve state-of-the-art performance in both geometric-semantic prediction and uncertainty estimation on NYU-Depth V2 and Cityscapes.

Abstract:
Image inpainting methods leverage the similarity of adjacent pixels to create alternative content. However, as the invisible region becomes larger, the pixels completed in the deeper hole are difficult to infer from the surrounding pixel signal, which is more prone to visual artifacts. To help fill this void, we adopt an alternative progressive hole-filling scheme that hierarchically fills the corrupted region in the feature and image spaces. This technique allows us to utilize reliable contextual information of the surrounding pixels, even for large hole samples, and then gradually complete the details as the resolution increases. For a more realistic representation of the completed region, we devise a pixel-wise dense detector. By distinguishing each pixel as whether it is a masked region or not, and passing the gradient to all resolutions, the generator further enhances the potential quality of the compositing. Furthermore, the completed images at different resolutions are then merged using a proposed structure transfer module (STM) that incorporates fine-grained local and coarse-grained global interactions. In this new mechanism, each completed image at the different resolutions attends its closest composition at fine granularity adjacent image and thus can capture the global continuity by interacting both short- and long-range dependencies. By comparing our solutions qualitatively and quantitatively with state-of-the-art methods, we conclude that our model exhibits a significantly improved visual quality, even in the case of large holes.

Abstract:
Video rescaling has recently drawn extensive attention for its practical applications such as video compression. Compared to video super-resolution, which focuses on upscaling bicubic-downscaled videos, video rescaling methods jointly optimize a downscaler and a upscaler. However, the inevitable loss of information during downscaling makes the upscaling procedure still ill-posed. Furthermore, the network architecture of previous methods mostly relies on convolution to aggregate information within local regions, which cannot effectively capture the relationship between distant locations. To address the above two issues, we propose a unified video rescaling framework by introducing the following designs. First, we propose to regularize the information of the downscaled videos via a contrastive learning framework, where, particularly, hard negative samples for learning are synthesized online. With this auxiliary contrastive learning objective, the downscaler tends to retain more information that benefits the upscaler. Second, we present a selective global aggregation module (SGAM) to efficiently capture long-range redundancy in high-resolution videos, where only a few representative locations are adaptively selected to participate in the computationally-heavy self-attention (SA) operations. SGAM enjoys the efficiency of the sparse modeling scheme while preserving the global modeling capability of SA. We refer to the proposed framework as Contrastive Learning framework with Selective Aggregation (CLSA) for video rescaling. Comprehensive experimental results show that CLSA outperforms video rescaling and rescaling-based video compression methods on five datasets, achieving state-of-the-art performance.

Abstract:
Incomplete multi-view clustering (IMVC) analysis, where some views of multi-view data usually have missing data, has attracted increasing attention. However, existing IMVC methods still have two issues: 1) they pay much attention to imputing or recovering the missing data, without considering the fact that the imputed values might be inaccurate due to the unknown label information, 2) the common features of multiple views are always learned from the complete data, while ignoring the feature distribution discrepancy between the complete and incomplete data. To address these issues, we propose an imputation-free deep IMVC method and consider distribution alignment in feature learning. Concretely, the proposed method learns the features for each view by autoencoders and utilizes an adaptive feature projection to avoid the imputation for missing data. All available data are projected into a common feature space, where the common cluster information is explored by maximizing mutual information and the distribution alignment is achieved by minimizing mean discrepancy. Additionally, we design a new mean discrepancy loss for incomplete multi-view learning and make it applicable in mini-batch optimization. Extensive experiments demonstrate that our method achieves the comparable or superior performance compared with state-of-the-art methods.

Abstract:
Recently deep learning-based image compression methods have achieved significant achievements and gradually outperformed traditional approaches including the latest standard Versatile Video Coding (VVC) in both PSNR and MS-SSIM metrics. Two key components of learned image compression are the entropy model of the latent representations and the encoding/decoding network architectures. Various models have been proposed, such as autoregressive, softmax, logistic mixture, Gaussian mixture, and Laplacian. Existing schemes only use one of these models. However, due to the vast diversity of images, it is not optimal to use one model for all images, even different regions within one image. In this paper, we propose a more flexible discretized Gaussian-Laplacian-Logistic mixture model (GLLMM) for the latent representations, which can adapt to different contents in different images and different regions of one image more accurately and efficiently, given the same complexity. Besides, in the encoding/decoding network design part, we propose a concatenated residual blocks (CRB), where multiple residual blocks are serially connected with additional shortcut connections. The CRB can improve the learning ability of the network, which can further improve the compression performance. Experimental results using the Kodak, Tecnick-100 and Tecnick-40 datasets show that the proposed scheme outperforms all the leading learning-based methods and existing compression standards including VVC intra coding (4:4:4 and 4:2:0) in terms of the PSNR and MS-SSIM. The source code is available at https://github.com/fengyurenpingsheng.

Abstract:
The Versatile Video Coding (VVC) standard introduces a block partitioning structure known as quadtree plus nested multi-type tree (QTMTT), which allows more flexible block partitioning compared to its predecessors, like High Efficiency Video Coding (HEVC). Meanwhile, the partition search (PS) process, which is to find out the best partitioning structure for optimizing the rate-distortion cost, becomes far more complicated for VVC than for HEVC. Also, the PS process in VVC reference software (VTM) is not friendly to hardware implementation. We propose a partition map prediction method for fast block partitioning in VVC intra-frame encoding. The proposed method may replace PS totally or be combined with PS partially, thereby achieving adjustable acceleration of the VTM intra-frame encoding. Different from the previous methods for fast block partitioning, we propose to represent a QTMTT-based block partitioning structure by a partition map, which consists of a quadtree (QT) depth map, several multi-type tree (MTT) depth maps, and several MTT direction maps. We then propose to predict the optimal partition map from the pixels through a convolutional neural network (CNN). We propose a CNN structure, known as Down-Up-CNN, for the partition map prediction, where the CNN structure emulates the recursive nature of the PS process. Moreover, we design a post-processing algorithm to adjust the network output partition map, so as to obtain a standard-compliant block partitioning structure. The post-processing algorithm may produce a partial partition tree as well; then based on the partial partition tree, the PS process is performed to obtain the full tree. Experimental results show that the proposed method achieves 1.61× to 8.64× encoding acceleration for the VTM-10.0 intra-frame encoder, with the ratio depending on how much PS is performed. Especially, when achieving 3.89× encoding acceleration, the compression efficiency loss is 2.77% in BD-rate, which is a better tradeoff than the previous methods.

Abstract:
Automatic data augmentation is a technique to automatically search for strategies for image transformations, which can improve the performance of different vision tasks. RandAugment (RA), one of the most widely used automatic data augmentations, achieves great success in different scales of models and datasets. However, RA randomly selects transformations with equivalent probabilities and applies a single magnitude for all transformations, which is suboptimal for different models and datasets. In this paper, we develop Differentiable RandAugment (DRA) to learn selecting weights and magnitudes of transformations for RA. The magnitude of each transformation is modeled following a normal distribution with both learnable mean and standard deviation. We also introduce the gradient of transformations to reduce the bias in gradient estimation and KL divergence as part of the loss to reduce the optimization gap. Experiments on CIFAR-10/100 and ImageNet demonstrate the efficiency and effectiveness of DRA. Searching for only 0.95 GPU hours on ImageNet, DRA can reach a Top-1 accuracy of 78.19% with ResNet-50, which outperforms RA by 0.28% under the same settings. Transfer learning on object detection also demonstrates the power of DRA. The proposed DRA is one of the few that surpasses RA on ImageNet and has great potential to be integrated into modern training pipelines to achieve state-of-the-art performance. Our code will be made publicly available for out-of-the-box use.

Abstract:
Despite the recent success achieved by deep neural networks (DNNs), it remains challenging to disclose/explain the decision-making process from the numerous parameters and complex non-linear functions. To address the problem, explainable AI (XAI) aims to provide explanations corresponding to the learning and prediction processes for deep learning models. In this paper, we propose a novel representation learning framework of Describe, Spot and eXplain (DSX). Based on the architecture of Transformer, our proposed DSX framework is composed of two learning stages, descriptive prototype learning and discriminative prototype discovery. Given an input image, the former stage is designed to derive a set of descriptive representations, while the latter stage further identifies a discriminative subset, offering semantic interpretability for the corresponding classification tasks. While our DSX does not require any ground truth attribute supervision during training, the derived visual representations can be practically associated with physical attributes provided by domain experts. Extensive experiments on fine-grained classification and person re-identification tasks qualitatively and quantitatively verify the use our DSX model for offering semantically practical interpretability with satisfactory recognition performances.

Abstract:
The presence of radically irregular data points (RIDPs), which are referred to as the subset of measurements that represents no or little information, can significantly degrade the performance of ellipse fitting methods. We develop an ellipse fitting method that is robust to RIDPs based on the maximum correntropy criterion with variable center (MCC-VC), where an adaptable Laplacian kernel is used. For single ellipse fitting, we formulate a non-convex optimization problem and divide it into two subproblems, one to estimate the kernel bandwidth and the other the kernel center. We design sufficiently accurate convex approximation to each subproblem that will lead to computationally efficient closed-form solutions. The two subproblems are solved in an alternate manner until convergence is reached. We also investigate coupled ellipses fitting. While there exist multiple ellipses fitting methods in the literature, we develop a coupled ellipses fitting method by exploiting the underlying special structure, where the associations between the data points and ellipses are absent in the problem. The proposed method first introduces an association vector for each data point and then formulates a non-convex mixed-integer optimization problem to establish the data associations, which is approximately solved by relaxing it into a second-order cone program. Using the estimated data associations, we then extend the proposed single ellipse fitting method to accomplish the final coupled ellipses fitting. The proposed method is shown to perform significantly better than the existing methods using both simulated data and real images.

Abstract:
Multi-view Stereo (MVS) aims to reconstruct a 3D point cloud model from multiple views. In recent years, learning-based MVS methods have received a lot of attention and achieved excellent performance compared with traditional methods. However, these methods still have apparent shortcomings, such as the accumulative error in the coarse-to-fine strategy and the inaccurate depth hypotheses based on the uniform sampling strategy. In this paper, we propose the NR-MVSNet, a coarse-to-fine structure with the depth hypotheses based on the normal consistency (DHNC) module, and the depth refinement with reliable attention (DRRA) module. Specifically, we design the DHNC module to generate more effective depth hypotheses, which collects the depth hypotheses from neighboring pixels with the same normals. As a result, the predicted depth can be smoother and more accurate, especially in texture-less and repetitive-texture regions. On the other hand, we update the initial depth map in the coarse stage by the DRRA module, which can combine attentional reference features and cost volume features to improve the depth estimation accuracy in the coarse stage and address the accumulative error problem. Finally, we conduct a series of experiments on the DTU, BlendedMVS, Tanks & Temples, and ETH3D datasets. The experimental results demonstrate the efficiency and robustness of our NR-MVSNet compared with the state-of-the-art methods. Our implementation is available at https://github.com/wdkyh/NR-MVSNet.

Abstract:
Known as a hard nut, the single-model transferable targeted attacks via decision-level optimization objectives have attracted much attention among scholars for a long time. On this topic, recent works devoted themselves to designing new optimization objectives. In contrast, we take a closer look at the intrinsic problems in three commonly adopted optimization objectives, and propose two simple yet effective methods in this paper to mitigate these intrinsic problems. Specifically, inspired by the basic idea of adversarial learning, we, for the first time, propose a unified Adversarial Optimization Scheme (AOS) to release both the problems of gradient vanishing in cross-entropy loss and gradient amplification in Po+Trip loss, and indicate that our AOS, a simple transformation on the output logits before passing them to the objective functions, can yield considerable improvements on the targeted transferability. Besides, we make a further clarification on the preliminary conjecture in Vanilla Logit Loss (VLL) and point out the problem of unbalanced optimization in VLL, in which the source logit may risk getting increased without the explicit suppression on it, leading to the low transferability. Then, the Balanced Logit Loss (BLL) is further proposed, where we take both the source logit and the target logit into account. Comprehensive validations witness the compatibility and the effectiveness of the proposed methods across most attack frameworks, and their effectiveness can also span two tough cases (i.e., the low-ranked transfer scenario and the transfer to defense methods) and three datasets (i.e., the ImageNet, CIFAR-10, and CIFAR-100). Our source code is available at https://github.com/xuxiangsun/DLLTTAA.

Abstract:
In recent years, various neural network architectures for computer vision have been devised, such as the visual transformer and multilayer perceptron (MLP). A transformer based on an attention mechanism can outperform a traditional convolutional neural network. Compared with the convolutional neural network and transformer, the MLP introduces less inductive bias and achieves stronger generalization. In addition, a transformer shows an exponential increase in the inference, training, and debugging times. Considering a wave function representation, we propose the WaveNet architecture that adopts a novel vision task-oriented wavelet-based MLP for feature extraction to perform salient object detection in RGB (red–green–blue)-thermal infrared images. In addition, we apply knowledge distillation to a transformer as an advanced teacher network to acquire rich semantic and geometric information and guide WaveNet learning with this information. Following the shortest-path concept, we adopt the Kullback–Leibler distance as a regularization term for the RGB features to be as similar to the thermal infrared features as possible. The discrete wavelet transform allows for the examination of frequency-domain features in a local time domain and time-domain features in a local frequency domain. We apply this representation ability to perform cross-modality feature fusion. Specifically, we introduce a progressively cascaded sine–cosine module for cross-layer feature fusion and use low-level features to obtain clear boundaries of salient objects through the MLP. Results from extensive experiments indicate that the proposed WaveNet achieves impressive performance on benchmark RGB-thermal infrared datasets. The results and code are publicly available at https://github.com/nowander/WaveNet.

Abstract:
Pedestrian detection is still a challenging task for computer vision, especially in crowded scenes where the overlaps between pedestrians tend to be large. The non-maximum suppression (NMS) plays an important role in removing the redundant false positive detection proposals while retaining the true positive detection proposals. However, the highly overlapped results may be suppressed if the threshold of NMS is lower. Meanwhile, a higher threshold of NMS will introduce a larger number of false positive results. To solve this problem, we propose an optimal threshold prediction (OTP) based NMS method that predicts a suitable threshold of NMS for each human instance. First, a visibility estimation module is designed to obtain the visibility ratio. Then, we propose a threshold prediction subnet to determine the optimal threshold of NMS automatically according to the visibility ratio and classification score. Finally, we re-formulate the objective function of the subnet and utilize the reward-guided gradient estimation algorithm to update the subnet. Comprehensive experiments on CrowdHuman and CityPersons show the superior performance of the proposed method in pedestrian detection, especially in crowded scenes.

Abstract:
Intra prediction is a crucial part of video compression, which utilizes local information in images to eliminate spatial redundancy. As the state-of-the-art video coding standard, Versatile Video Coding (H.266/VVC) employs multiple directional prediction modes in intra prediction to find the texture trend of local areas. Then the prediction is made based on reference samples in the selected direction. Recently, neural network-based intra prediction has achieved great success. Deep network models are trained and applied to assist the HEVC and VVC intra modes. In this paper, we propose a novel tree-structured data clustering-driven neural network (dubbed TreeNet) for intra prediction, which builds the networks and clusters the training data in a tree-structured manner. Specifically, in each network split and training process of TreeNet, every parent network on a leaf node is split into two child networks by adding or subtracting Gaussian random noise. Then data clustering-driven training is applied to train the two derived child networks using the clustered training data of their parent. On the one hand, the networks at the same level in TreeNet are trained with non-overlapping clustered datasets, and thus they can learn different prediction abilities. On the other hand, the networks at different levels are trained with hierarchically clustered datasets, and thus they will have different generalization abilities. TreeNet is integrated into VVC to assist or replace intra prediction modes to test its performance. In addition, a fast termination strategy is proposed to accelerate the search of TreeNet. The experimental results demonstrate that when TreeNet is used to assist the VVC Intra modes, TreeNet with depth = 3 can bring an average of 3.78% bitrate saving (up to 8.12%) over VTM-17.0. If TreeNet with the same depth replaces all VVC intra modes, an average of 1.59% bitrate saving can be reached.

Abstract:
Recognizing human actions in dark videos is a useful yet challenging visual task in reality. Existing augmentation-based methods separate action recognition and dark enhancement in a two-stage pipeline, which leads to inconsistently learning of temporal representation for action recognition. To address this issue, we propose a novel end-to-end framework termed Dark Temporal Consistency Model (DTCM), which is able to jointly optimize dark enhancement and action recognition, and force the temporal consistency to guide downstream dark feature learning. Specifically, DTCM cascades the action classification head with the dark augmentation network to perform dark video action recognition in a one-stage pipeline. Our explored spatio-temporal consistency loss, which utilizes the RGB-Difference of dark video frames to encourage temporal coherence of the enhanced video frames, is effective for boosting spatio-temporal representation learning. Extensive experiments demonstrated that our DTCM has remarkable performance: 1) Competitive accuracy, which outperforms the state-of-the-arts on the ARID dataset by 2.32% and the UAVHuman-Fisheye dataset by 4.19% in accuracy, respectively; 2) High efficiency, which surpasses the current most advanced method (Chen et al., 2021) with only 6.4% GFLOPs and 71.3% number of parameters; 3) Strong generalization, which can be used in various action recognition methods (e.g., TSM, I3D, 3D-ResNext-101, Video-Swin) to promote their performance significantly.

Abstract:
This paper proposes a novel data-driven approach to designing orthonormal transform matrix codebooks for adaptive transform coding of any non-stationary vector processes which can be considered locally stationary. Our algorithm, which belongs to the class of block-coordinate descent algorithms, relies on simple probability models such as Gaussian or Laplacian for transform coefficients to directly minimize with respect to the orthonormal transform matrix the mean square error (MSE) of scalar quantization and entropy coding of transform coefficients. A difficulty commonly encountered in such minimization problems is imposing the orthonormality constraint on the matrix solution. We get around this difficulty by mapping the constrained problem in Euclidean space to an unconstrained problem on the Stiefel manifold and leveraging known algorithms for unconstrained optimization on manifolds. While the basic design algorithm directly applies to non-separable transforms, an extension to separable transforms is also proposed. We present experimental results for adaptive transform coding of still images and video inter-frame prediction residuals, comparing the transforms designed using the proposed method and a number of other content-adaptive transforms recently reported in the literature.

Abstract:
Scene appearance changes drastically throughout the day. Existing semantic segmentation methods mainly focus on well-lit daytime scenarios and are not well designed to cope with such great appearance changes. Naively using domain adaption does not solve this problem because it usually learns a fixed mapping between the source and target domain and thus have limited generalization capability on all-day scenarios (i. e., from dawn to night). In this paper, in contrast to existing methods, we tackle this challenge from the perspective of image formulation itself, where the image appearance is determined by both intrinsic (e. g., semantic category, structure) and extrinsic (e. g., lighting) properties. To this end, we propose a novel intrinsic-extrinsic interactive learning strategy. The key idea is to interact between intrinsic and extrinsic representations during the learning process under spatial-wise guidance. In this way, the intrinsic representation becomes more stable and, at the same time, the extrinsic representation gets better at depicting the changes. Consequently, the refined image representation is more robust to generate pixel-wise predictions for all-day scenarios. To achieve this, we propose an All-in-One Segmentation Network (AO-SegNet) in an end-to-end manner. Large scale experiments are conducted on three real datasets (Mapillary, BDD100K and ACDC) and our proposed synthetic All-day CityScapes dataset. The proposed AO-SegNet shows a significant performance gain against the state-of-the-art under a variety of CNN and ViT backbones on all the datasets.

Abstract:
Image deblurring and its counterpart blind problem are undoubtedly two fundamental tasks in computational imaging and computer vision. Interestingly, deterministic edge-preserving regularization for maximum-a-posteriori (MAP) based non-blind image deblurring has been largely made clear 25 years ago. As for the blind task, the state-of-the-art MAP-based approaches seem to also reach a consensus on the characteristic of deterministic image regularization, i.e., formulated in an L0 composite style or termed as L0+X style, where X is often a discriminative term such as dark channels-based sparsity regularization. However, with a modeling perspective as such, non-blind and blind deblurring are entirely disconnected from each other. Additionally, because L0 and X are motivated very differently in general, it is not easy in practice to derive an efficient numerical scheme. In fact, since the prosperity of modern blind deblurring 15 years ago, a physically intuitive yet practically effective and efficient regularization has been always desired. In this paper, representative deterministic image regularization terms in MAP-based blind deblurring are firstly revisited, with an emphasis on their differences from edge-preserving regularization for non-blind deblurring. Inspired by existing robust losses in the statistical and deep learning literature, an insightful conjecture is then made. That is, deterministic image regularization for blind deblurring can be naively formulated using a type of redescending potential functions (RDP), and interestingly, a RDP-induced blind deblurring regularization term is actually the 1^rst -order derivative of a nonconvex edge-preserving regularization for non-blind image deblurring. An intimate relationship in regularization is therefore established between the two problems, differing much from the mainstream modeling perspective on blind deblurring. Via above principle analysis, the conjecture is demonstrated on benchmark deblurring problems in the final, accompanied with comparisons against several top-performing L0+X style methods. We note that, the rationality and practicality of the RDP-induced regularization is particularly highlighted here, aiming to open up an alternative line of possibility for modeling blind deblurring.

Abstract:
Millimeter-wave (MMW) imaging techniques have been widely used in the public security industries for their under-controlled privacy concerns and no health hazards. However, since MMW images are low resolution and most objects are small, reflection-weak, diverse, suspicious object detection in the MMW images is a very challenging task. This paper develops a robust suspicious object detector for the MMW images based on the Siamese network integrated with the pose estimation and image segmentation, which estimates the coordinates of human joints and segments the complete human images into symmetrical body part images. Unlike most existing detectors, which detect and recognize suspicious objects in MMW images and require a complete training set with correct annotations, our proposed model aims to learn the similarity between two symmetrical human body part images segmented from the complete MMW images. Furthermore, to decrease the misdetection caused by the restricted field of view, we further fuse the multi-view MMW images observed from the same person by designing a decision-level fusion strategy and feature-level fusion strategy based on the attention mechanism. Experimental results on the measured MMW images show that our proposed models have favorable detection accuracy and speed in practical application and thus prove their effectiveness.

Abstract:
As a prerequisite step of scene text reading, scene text detection is known as a challenging task due to natural scene text diversity and variability. Most existing methods either adopt bottom-up sub-text component extraction or focus on top-down text contour regression. From a hybrid perspective, we explore hierarchical text instance-level and component-level representation for arbitrarily-shaped scene text detection. In this work, we propose a novel Hierarchical Graph Reasoning Network (HGR-Net), which consists of a Text Feature Extraction Network (TFEN) and a Text Relation Learner Network (TRLN). TFEN adaptively learns multi-grained text candidates based on shared convolutional feature maps, including instance-level text contours and component-level quadrangles. In TRLN, an inter-text graph is constructed to explore global contextual information with position-awareness between text instances, and an intra-text graph is designed to estimate geometric attributes for establishing component-level linkages. Next, we bridge the cross-feed interaction between instance-level and component-level, and it further achieves hierarchical relational reasoning by learning complementary graph embeddings across levels. Experiments conducted on three publicly available benchmarks SCUT-CTW1500, Total-Text, and ICDAR15 have demonstrated that HGR-Net achieves state-of-the-art performance on arbitrary orientation and arbitrary shape scene text detection.

Abstract:
Light field (LF) images containing information for multiple views have numerous applications, which can be severely affected by low-light imaging. Recent learning-based methods for low-light enhancement have some disadvantages, such as a lack of noise suppression, complex training process and poor performance in extremely low-light conditions. To tackle these deficiencies while fully utilizing the multi-view information, we propose an efficient Low-light Restoration Transformer (LRT) for LF images, with multiple heads to perform intermediate tasks within a single network, including denoising, luminance adjustment, refinement and detail enhancement, achieving progressive restoration from small scale to full scale. Moreover, we design an angular transformer block with an efficient view-token scheme to model the global angular dependencies, and a multi-scale spatial transformer block to encode the multi-scale local and global information within each view. To address the issue of insufficient training data, we formulate a synthesis pipeline by simulating the major noise sources with the estimated noise parameters of LF camera. Experimental results demonstrate that our method achieves the state-of-the-art performance on low-light LF restoration with high efficiency.

Abstract:
There are demographic biases present in current facial recognition (FR) models. To measure these biases across different ethnic and gender subgroups, we introduce our Balanced Faces in the Wild (BFW) dataset. This dataset allows for the characterization of FR performance per subgroup. We found that relying on a single score threshold to differentiate between genuine and imposters sample pairs leads to suboptimal results. Additionally, performance within subgroups often varies significantly from the global average. Therefore, specific error rates only hold for populations that match the validation data. To mitigate imbalanced performances, we propose a novel domain adaptation learning scheme that uses facial features extracted from state-of-the-art neural networks. This scheme boosts the average performance and preserves identity information while removing demographic knowledge. Removing demographic knowledge prevents potential biases from affecting decision-making and protects privacy by eliminating demographic information. We explore the proposed method and demonstrate that subgroup classifiers can no longer learn from features projected using our domain adaptation scheme. For access to the source code and data, please visit https://github.com/visionjo/facerec-bias-bfw.

Abstract:
Composing Text and Image to Image Retrieval (CTI-IR) aims at finding the target image, which matches the query image visually along with the query text semantically. However, existing works ignore the fact that the reference text usually serves multiple functions, e.g., modification and auxiliary. To address this issue, we put forth a unified solution, namely Hierarchical Aggregation Transformer incorporated with Cross Relation Network (CRN). CRN unifies modification and relevance manner in a single framework. This configuration shows broader applicability, enabling us to model both modification and auxiliary text or their combination in triplet relationships simultaneously. Specifically, CRN includes: 1) Cross Relation Network comprehensively captures the relationships of various composed retrieval scenarios caused by two different query text types, allowing a unified retrieval model to designate adaptive combination strategies for flexible applicability; 2) Hierarchical Aggregation Transformer aggregates top-down features with Multi-layer Perceptron (MLP) to overcome the limitations of edge information loss in a window-based multi-stage Transformer. Extensive experiments demonstrate the superiority of the proposed CRN over all three fashion-domain datasets. Code is available at github.com/yan9qu/crn.

Affiliations: Institute of Science and Technology Innovation, Dongguan University of Technology, Dongguan, China; Engineering Research Center of Digital Forensics, Ministry of Education, School of Computer and Software, Jiangsu Engineering Center of Network Monitoring, Jiangsu Collaborative Innovation Center on Atmospheric Environment and Equipment Technology, Nanjing University of Information Science and Technology, Nanjing, China; National and Local Joint Engineering Research Center of Network Security Detection and Protection Technology, Guangdong Provincial Key Laboratory of Data Security and Privacy Protection, Engineering Research Center of Trustworthy AI, Ministry of Education, College of Information Science and Technology, Jinan University, Guangzhou, China

Abstract:
Cloud computing has become an important IT infrastructure in the big data era; more and more users are motivated to outsource the storage and computation tasks to the cloud server for convenient services. However, privacy has become the biggest concern, and tasks are expected to be processed in a privacy-preserving manner. This paper proposes a secure SIFT feature extraction scheme with better integrity, accuracy and efficiency than the existing methods. SIFT includes lots of complex steps, including the construction of DoG scale space, extremum detection, extremum location adjustment, rejecting of extremum point with low contrast, eliminating of the edge response, orientation assignment, and descriptor generation. These complex steps need to be disassembled into elementary operations such as addition, multiplication, comparison for secure implementation. We adopt a serial of secret-sharing protocols for better accuracy and efficiency. In addition, we design a secure absolute value comparison protocol to support absolute value comparison operations in the secure SIFT feature extraction. The SIFT feature extraction steps are completely implemented in the ciphertext domain. And the communications between the clouds are appropriately packed to reduce the communication rounds. We carefully analyzed the accuracy and efficiency of our scheme. The experimental results show that our scheme outperforms the existing state-of-the-art.

Abstract:
In few-shot classification, performing well on a testing dataset is a challenging task due to the restricted amount of labelled data available and the unknown distribution. Many previously proposed techniques rely on prototypical representations of the support set in order to classify a query set. Although this approach works well with a large, in-domain support set, accuracy suffers when transitioning to an out-of-domain setting, especially when using small support sets. To address out-of-domain performance degradation with small support sets, we propose Masked Embedding Modeling for Few-Shot Learning (MEM-FS), a novel, self-supervised, generative technique that reinforces few-shot-classification accuracy for a prototypical backbone model. MEM-FS leverages the data completion capabilities of a masked autoencoder to expand a given embedded support set. To further increase out-of-domain performance, we also introduce Rapid Domain Adjustment (RDA), a novel, self-supervised process for quickly conditioning MEM-FS to a new domain. We show that masked support embeddings generated by MEM-FS+RDA can significantly improve backbone performance on both out-of-domain and in-domain datasets. Our experiments demonstrate that applying the proposed technique to an inductive classifier achieves state-of-the-art performance on mini-imagenet, the CVPR L2ID Classification Challenge, and a newly proposed dataset, IKEA-FS. We provide code for this work at https://github.com/Brikwerk/MEM-FS

Abstract:
Since high-order relationships among multiple brain regions-of-interests (ROIs) are helpful to explore the pathogenesis of neurological diseases more deeply, hypergraph-based brain networks are more suitable for brain science research. Unlike the existing hypergraph based brain network (brain hypernetwork), where hyperedges containing the same number of ROIs are assumed to have equal weights (to some extent, the network is unweighted), and the underlying structure is described only by an incidence/adjacency matrix, in this paper, we propose a framework for constructing a truly weighted brain hypernetwork described by an adjacency tensor. Considering the relationships among vertices within a hyperedge, we propose a novel hyperedge weight estimation method and convert the incidence matrix into a weighted adjacency tensor. On the basis of tensor decomposition, we apply hypergraph signal processing tools, such as hypergraph Fourier transform, to analyze and compare the spectrum between schizophrenia patients and normal controls. It is found that there are more high frequency components in the spectrum of patients than controls, and the average amplitude is significantly greater than that of controls. Instead of extracting some simple topological features from brain hypernetworks for classification, we innovatively use the hypergraph spectrum and the spectral signal as classification features, and the classification results on two public datasets demonstrate the effectiveness of our proposed method.

Abstract:
Lately, video-language pre-training and text-video retrieval have attracted significant attention with the explosion of multimedia data on the Internet. However, existing approaches for video-language pre-training typically limit the exploitation of the hierarchical semantic information in videos, such as frame semantic information and global video semantic information. In this work, we present an end-to-end pre-training network with Hierarchical Matching and Momentum Contrast named HMMC. The key idea is to explore the hierarchical semantic information in videos via multilevel semantic matching between videos and texts. This design is motivated by the observation that if a video semantically matches a text (can be a title, tag or caption), the frames in this video usually have semantic connections with the text and show higher similarity than frames in other videos. Hierarchical matching is mainly realized by two proxy tasks: Video-Text Matching (VTM) and Frame-Text Matching (FTM). Another proxy task: Frame Adjacency Matching (FAM) is proposed to enhance the single visual modality representations while training from scratch. Furthermore, momentum contrast framework was introduced into HMMC to form a multimodal momentum contrast framework, enabling HMMC to incorporate more negative samples for contrastive learning which contributes to the generalization of representations. We also collected a large-scale Chinese video-language dataset (over 763k unique videos) named CHVTT to explore the multilevel semantic connections between videos and texts. Experimental results on two major Text-video retrieval benchmark datasets demonstrate the advantages of our methods. We release our code at https://github.com/cheetah003/HMMC.

Abstract:
Hyperspectral image (HSI) classification has always been recognised as a difficult task. It is therefore a research hotspot in remote sensing image processing and analysis, and a number of studies have been conducted to better extract spectral and spatial features. This study aimed to track the variation of the spectrum in hyperspectral images from a sequential data perspective to obtain more distinguishable features. Based on the characteristics of optical flow, this study introduces an optical flow technique for the extraction of spectral flow that denotes the spectral variation and implements a dense optical flow extraction method based on deep matching. Lastly, the extracted spectral flow are combined with the original spectral features and input into a commonly used support vector machine (SVM) classifier to complete the classification. Extensive classification experiments on three benchmark HSI test sets show that the classification accuracy obtained by the spectral flow extracted in this study (SpectralFlow) is higher than traditional spatial feature extraction methods, texture feature extraction methods, and the latest deep-learning-based methods. Furthermore, the proposed method can produce finer classification thematic maps, thereby demonstrating strong practical application potential.

Abstract:
Existing methods for Salient Object Detection in Optical Remote Sensing Images (ORSI-SOD) mainly adopt Convolutional Neural Networks (CNNs) as the backbone, such as VGG and ResNet. Since CNNs can only extract features within certain receptive fields, most ORSI-SOD methods generally follow the local-to-contextual paradigm. In this paper, we propose a novel Global Extraction Local Exploration Network (GeleNet) for ORSI-SOD following the global-to-local paradigm. Specifically, GeleNet first adopts a transformer backbone to generate four-level feature embeddings with global long-range dependencies. Then, GeleNet employs a Direction-aware Shuffle Weighted Spatial Attention Module (D-SWSAM) and its simplified version (SWSAM) to enhance local interactions, and a Knowledge Transfer Module (KTM) to further enhance cross-level contextual interactions. D-SWSAM comprehensively perceives the orientation information in the lowest-level features through directional convolutions to adapt to various orientations of salient objects in ORSIs, and effectively enhances the details of salient objects with an improved attention mechanism. SWSAM discards the direction-aware part of D-SWSAM to focus on localizing salient objects in the highest-level features. KTM models the contextual correlation knowledge of two middle-level features of different scales based on the self-attention mechanism, and transfers the knowledge to the raw features to generate more discriminative features. Finally, a saliency predictor is used to generate the saliency map based on the outputs of the above three modules. Extensive experiments on three public datasets demonstrate that the proposed GeleNet outperforms relevant state-of-the-art methods. The code and results of our method are available at https://github.com/MathLee/GeleNet.

Abstract:
We propose a weakly supervised approach for salient object detection from multi-modal RGB-D data. Our approach only relies on labels from scribbles, which are much easier to annotate, compared with dense labels used in conventional fully supervised setting. In contrast to existing methods that employ supervision signals on the output space, our design regularizes the intermediate latent space to enhance discrimination between salient and non-salient objects. We further introduce a contour detection branch to implicitly constrain the semantic boundaries and achieve precise edges of detected salient objects. To enhance the long-range dependencies among local features, we introduce a Cross-Padding Attention Block (CPAB). Extensive experiments on seven benchmark datasets demonstrate that our method not only outperforms existing weakly supervised methods, but is also on par with several fully-supervised state-of-the-art models. Code is available at https://github.com/leolyj/DHFR-SOD.

Abstract:
Unsupervised cross-domain Facial Expression Recognition (FER) aims to transfer the knowledge from a labeled source domain to an unlabeled target domain. Existing methods strive to reduce the discrepancy between source and target domain, but cannot effectively explore the abundant semantic information of the target domain due to the absence of target labels. To this end, we propose a novel framework via Contrastive Warm up and Complexity-aware Self-Training (namely CWCST), which facilitates source knowledge transfer and target semantic learning jointly. Specifically, we formulate a contrastive warm up strategy via features, momentum features, and learnable category centers to concurrently learn discriminative representations and narrow the domain gap, which benefits domain adaptation by generating more accurate target pseudo labels. Moreover, to deal with the inevitable noise in pseudo labels, we develop complexity-aware self-training with a label selection module based on prediction entropy, which iteratively generates pseudo labels and adaptively chooses the reliable ones for training, ultimately yielding effective target semantics exploration. Furthermore, by jointly using the two mentioned components, our framework enables to effectively utilize the source knowledge and target semantic information by source-target co- training. In addition, our framework can be easily incorporated into other baselines with consistent performance improvements. Extensive experimental results on seven databases show the superior performance of the proposed method against various baselines.

Abstract:
For autonomous vehicles (AVs), visual perception techniques based on sensors like cameras play crucial roles in information acquisition and processing. In various computer perception tasks for AVs, it may be helpful to match landmark patches taken by an onboard camera with other landmark patches captured at a different time or saved in a street scene image database. To perform matching under challenging driving environments caused by changing seasons, weather, and illumination, we utilize the spatial neighborhood information of each patch. We propose an approach, named RobustMat, which derives its robustness to perturbations from neural differential equations. A convolutional neural ODE diffusion module is used to learn the feature representation for the landmark patches. A graph neural PDE diffusion module then aggregates information from neighboring landmark patches in the street scene. Finally, feature similarity learning outputs the final matching score. Our approach is evaluated on several street scene datasets and demonstrated to achieve state-of-the-art matching results under environmental perturbations.

Abstract:
When adopting a model-based formulation, solving inverse problems encountered in multiband imaging requires to define spatial and spectral regularizations. In most of the works of the literature, spectral information is extracted from the observations directly to derive data-driven spectral priors. Conversely, the choice of the spatial regularization often boils down to the use of conventional penalizations (e.g., total variation) promoting expected features of the reconstructed image (e.g., piece-wise constant). In this work, we propose a generic framework able to capitalize on an auxiliary acquisition of high spatial resolution to derive tailored data-driven spatial regularizations. This approach leverages on the ability of deep learning to extract high level features. More precisely, the regularization is conceived as a deep generative network able to encode spatial semantic features contained in this auxiliary image of high spatial resolution. To illustrate the versatility of this approach, it is instantiated to conduct two particular tasks, namely multiband image fusion and multiband image inpainting. Experimental results obtained on these two tasks demonstrate the benefit of this class of informed regularizations when compared to more conventional ones.

Abstract:
The synthesis of high-resolution remote sensing images based on text descriptions has great potential in many practical application scenarios. Although deep neural networks have achieved great success in many important remote sensing tasks, generating realistic remote sensing images from text descriptions is still very difficult. To address this challenge, we propose a novel text-to-image modern Hopfield network (Txt2Img-MHN). The main idea of Txt2Img-MHN is to conduct hierarchical prototype learning on both text and image embeddings with modern Hopfield layers. Instead of directly learning concrete but highly diverse text-image joint feature representations for different semantics, Txt2Img-MHN aims to learn the most representative prototypes from text-image embeddings, achieving a coarse-to-fine learning strategy. These learned prototypes can then be utilized to represent more complex semantics in the text-to-image generation task. To better evaluate the realism and semantic consistency of the generated images, we further conduct zero-shot classification on real remote sensing data using the classification model trained on synthesized images. Despite its simplicity, we find that the overall accuracy in the zero-shot classification may serve as a good metric to evaluate the ability to generate an image from text. Extensive experiments on the benchmark remote sensing text-image dataset demonstrate that the proposed Txt2Img-MHN can generate more realistic remote sensing images than existing methods. Code and pre-trained models are available online (https://github.com/YonghaoXu/Txt2Img-MHN).

Abstract:
Scene-text image synthesis techniques that aim to naturally compose text instances on background scene images are very appealing for training deep neural networks due to their ability to provide accurate and comprehensive annotation information. Prior studies have explored generating synthetic text images on two-dimensional and three-dimensional surfaces using rules derived from real-world observations. Some of these studies have proposed generating scene-text images through learning; however, owing to the absence of a suitable training dataset, unsupervised frameworks have been explored to learn from existing real-world data, which might not yield reliable performance. To ease this dilemma and facilitate research on learning-based scene text synthesis, we introduce DecompST, a real-world dataset prepared from some public benchmarks, containing three types of annotations: quadrilateral-level BBoxes, stroke-level text masks, and text-erased images. Leveraging the DecompST dataset, we propose a Learning-Based Text Synthesis engine (LBTS) that includes a text location proposal network (TLPNet) and a text appearance adaptation network (TAANet). TLPNet first predicts the suitable regions for text embedding, after which TAANet adaptively adjusts the geometry and color of the text instance to match the background context. After training, those networks can be integrated and utilized to generate the synthetic dataset for scene text analysis tasks. Comprehensive experiments were conducted to validate the effectiveness of the proposed LBTS along with existing methods, and the experimental results indicate the proposed LBTS can generate better pretraining data for scene text detectors. Our dataset and code are made available at: https://github.com/iiclab/DecompST.

Abstract:
The infrared small and dim (S&D) target detection is one of the key techniques in the infrared search and tracking system. Since the local regions similar to infrared S&D targets spread over the whole background, exploring the correlation amongst image features in large-range dependencies to mine the difference between the target and background is crucial for robust detection. However, existing deep learning-based methods are limited by the locality of convolutional neural networks, which impairs the ability to capture large-range dependencies. Additionally, the S&D appearance of the infrared target makes the detection model highly possible to miss detection. To this end, we propose a robust and general infrared S&D target detection method with the transformer. We adopt the self-attention mechanism of the transformer to learn the correlation of image features in a larger range. Moreover, we design a feature enhancement module to learn discriminative features of S&D targets to avoid miss-detections. After that, to avoid the loss of the target information, we adopt a decoder with the U-Net-like skip connection operation to contain more information of S&D targets. Finally, we get the detection result by a segmentation head. Extensive experiments on two public datasets show the obvious superiority of the proposed method over state-of-the-art methods, and the proposed method has a stronger generalization ability and better noise tolerance.

Abstract:
In recent years, advanced research has focused on the direct learning and analysis of remote-sensing images using natural language processing (NLP) techniques. The ability to accurately describe changes occurring in multi-temporal remote sensing images is becoming increasingly important for geospatial understanding and land planning. Unlike natural image change captioning tasks, remote sensing change captioning aims to capture the most significant changes, irrespective of various influential factors such as illumination, seasonal effects, and complex land covers. In this study, we highlight the significance of accurately describing changes in remote sensing images and present a comparison of the change captioning task for natural and synthetic images and remote sensing images. To address the challenge of generating accurate captions, we propose an attentive changes-to-captions network, called Chg2Cap for short, for bi-temporal remote sensing images. The network comprises three main components: 1) a Siamese CNN-based feature extractor to collect high-level representations for each image pair; 2) an attentive encoder that includes a hierarchical self-attention block to locate change-related features and a residual block to generate the image embedding; and 3) a transformer-based caption generator to decode the relationship between the image embedding and the word embedding into a description. The proposed Chg2Cap network is evaluated on two representative remote sensing datasets, and a comprehensive experimental analysis is provided. The code and pre-trained models will be available online at https://github.com/ShizhenChang/Chg2Cap.

Abstract:
In recent years, there has been a growing interest in combining learnable modules with numerical optimization to solve low-level vision tasks. However, most existing approaches focus on designing specialized schemes to generate image/feature propagation. There is a lack of unified consideration to construct propagative modules, provide theoretical analysis tools, and design effective learning mechanisms. To mitigate the above issues, this paper proposes a unified optimization-inspired learning framework to aggregate Generative, Discriminative, and Corrective (GDC for short) principles with strong generalization for diverse optimization models. Specifically, by introducing a general energy minimization model and formulating its descent direction from different viewpoints (i.e., in a generative manner, based on the discriminative metric and with optimality-based correction), we construct three propagative modules to effectively solve the optimization models with flexible combinations. We design two control mechanisms that provide the non-trivial theoretical guarantees for both fully- and partially-defined optimization formulations. Under the support of theoretical guarantees, we can introduce diverse architecture augmentation strategies such as normalization and search to ensure stable propagation with convergence and seamlessly integrate the suitable modules into the propagation respectively. Extensive experiments across varied low-level vision tasks validate the efficacy and adaptability of GDC.

Abstract:
This paper presents a novel non-parametric technique for two-dimensional spectrum readability enhancement. The approach is based on relocating a windowed bivariate Fourier transform with regard to its frequency estimates computed using a moving analyzing window. To this aim, four spatial instantaneous frequency estimators are proposed. A strongly concentrated spectrum with improved component separability is obtained with the proposed technique. The method was intensively tested using simulated and real-life signals. As an example of the method application, inverse synthetic aperture radar (ISAR) images were created and then focused, significantly improving the contrast and entropy. However, the presented technique can be applied to other bivariate signal analyses whenever the windowed two-dimensional Fourier transform (W2D-FT) is applied.

Abstract:
Head pose estimation (HPE) is an indispensable upstream task in the fields of human-machine interaction, self-driving, and attention detection. However, practical head pose applications suffer from several challenges, such as severe occlusion, low illumination, and extreme orientations. To address these challenges, we identify three cues from head images, namely, critical minority relationships, neighborhood orientation relationships, and significant facial changes. On the basis of the three cues, two key insights on head poses are revealed: 1) intra-orientation relationship and 2) cross-orientation relationship. To leverage two key insights above, a novel relationship-driven method is proposed based on the Transformer architecture, in which facial and orientation relationships can be learned. Specifically, we design several orientation tokens to explicitly encode basic orientation regions. Besides, a novel token guide multi-loss function is accordingly designed to guide the orientation tokens as they learn the desired regional similarities and relationships. Experimental results on three challenging benchmark HPE datasets show that our proposed TokenHPE achieves state-of-the-art performance. Moreover, qualitative visualizations are provided to verify the effectiveness of the token-learning methodology.

Abstract:
In recent years, point clouds have become increasingly popular for representing three-dimensional (3D) visual objects and scenes. To efficiently store and transmit point clouds, compression methods have been developed, but they often result in a degradation of quality. To reduce color distortion in point clouds, we propose a graph-based quality enhancement network (GQE-Net) that uses geometry information as an auxiliary input and graph convolution blocks to extract local features efficiently. Specifically, we use a parallel-serial graph attention module with a multi-head graph attention mechanism to focus on important points or features and help them fuse together. Additionally, we design a feature refinement module that takes into account the normals and geometry distance between points. To work within the limitations of GPU memory capacity, the distorted point cloud is divided into overlap-allowed 3D patches, which are sent to GQE-Net for quality enhancement. To account for differences in data distribution among different color components, three models are trained for the three color components. Experimental results show that our method achieves state-of-the-art performance. For example, when implementing GQE-Net on a recent test model of the geometry-based point cloud compression (G-PCC) standard, 0.43 dB, 0.25 dB and 0.36 dB Bjφntegaard delta (BD)-peak-signal-to-noise ratio (PSNR), corresponding to 14.0%, 9.3% and 14.5% BD-rate savings were achieved on dense point clouds for the Y, Cb, and Cr components, respectively. The source code of our method is available at https://github.com/xjr998/GQE-Net.

Abstract:
Aggregating neighbor features is essential for point cloud neural network. In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently. Thus, each point has to participate in the calculation repeatedly, generating redundant duplicates in the memory, leading to intensive computation costs and memory consumption. Meanwhile, to pursue higher accuracy, previous methods often rely on a complex local aggregator to extract fine geometric representation, further slowing down the processing pipeline. To address these issues, we propose a new local aggregator of linear complexity for point cloud analysis, coined as APP. Specifically, we introduce an auxiliary container as an anchor to exchange features between the source point and the aggregating center. Each source point pushes its feature to only one auxiliary container, and each center point pulls features from only one auxiliary container. This avoids the re-computation issue of each source point. To facilitate the learning of the local structure of point cloud, we use an online normal estimation module to provide explainable geometric information to enhance our APP modeling capability. Our built network is more efficient than all the previous baselines with a clear margin while still consuming a lower memory. Experiments on classification and semantic segmentation demonstrate that APP-Net reaches comparable accuracies to other networks. In the classification task, it can process more than 10,000 samples per second with less than 10GB of memory on a single GPU. We will release the code at https://github.com/MCG-NJU/ APP-Net.

Abstract:
Recently, point-based networks have exhibited extraordinary potential for 3D point cloud processing. However, owing to the meticulous design of both parameters and hyperparameters inside the network, constructing a promising network for each point cloud task can be an expensive endeavor. In this work, we develop a novel one-shot search framework called Point-NAS to automatically determine optimum architectures for various point cloud tasks. Specifically, we design an elastic feature extraction (EFE) module that serves as a basic unit for architecture search, which expands seamlessly alongside both the width and depth of the network for efficient feature extraction. Based on the EFE module, we devise a searching space, which is encoded into a supernet to provide a wide number of latent network structures for a particular point cloud task. To fully optimize the weights of the supernet, we propose a weight coupling sandwich rule that samples the largest, smallest, and multiple medium models at each iteration and fuses their gradients to update the supernet. Furthermore, we present a united gradient adjustment algorithm that mitigates gradient conflict induced by distinct gradient directions of sampled models and supernet, thus expediting the convergence of the supernet and assuring that it can be comprehensively trained. Pursuant to the provided techniques, the trained supernet enables a multitude of subnets to be incredibly well-optimized. Finally, we conduct an evolutionary search for the supernet under resource constraints to find promising architectures for different tasks. Experimentally, the searched Point-NAS with weights inherited from the supernet realizes outstanding results across a variety of benchmarks. i.e., 94.2% and 88.9% overall accuracy under ModelNet40 and ScanObjectNN, 68.6% mIoU under S3DIS, 63.6% and 69.3% mAP@0.25 under SUN RGB-D and ScanNet V2 datasets.

Abstract:
Image dehazing is an effective means to enhance the quality of images captured in foggy or hazy weather conditions. However, existing image dehazing methods are either ineffective in dealing with complex haze scenes, or incurring too much computation. To overcome these deficiencies, we propose a progressive feedback optimization network (PFONet) which is lightweight yet effective for image dehazing. The PFONet consists of a multi-stream dehazing module and a progressive feedback module. The progressive feedback module feeds the output dehazed image back to the intermedia features extracted by the network, thus enabling the network to gradually reconstruct a complex degraded image. Considering both the effectiveness and efficiency of the network, we also design a lightweight hybrid residual dense block serving as the basic feature extraction module of the proposed PFONet. Extensive experimental results are presented to demonstrate that the proposed model outperforms its state-of-the-art single-image dehazing competitors for both synthetic and real-world images.

Abstract:
Compared to typical multi-sensor systems, monocular 3D object detection has attracted much attention due to its simple configuration. However, there is still a significant gap between LiDAR-based and monocular-based methods. In this paper, we find that the ill-posed nature of monocular imagery can lead to depth ambiguity. Specifically, objects with different depths can appear with the same bounding boxes and similar visual features in the 2D image. Unfortunately, the network cannot accurately distinguish different depths from such non-discriminative visual features, resulting in unstable depth training. To facilitate depth learning, we propose a simple yet effective plug-and-play module, One Bounding Box Multiple Objects (OBMO). Concretely, we add a set of suitable pseudo labels by shifting the 3D bounding box along the viewing frustum. To constrain the pseudo-3D labels to be reasonable, we carefully design two label scoring strategies to represent their quality. In contrast to the original hard depth labels, such soft pseudo labels with quality scores allow the network to learn a reasonable depth range, boosting training stability and thus improving final performance. Extensive experiments on KITTI and Waymo benchmarks show that our method significantly improves state-of-the-art monocular 3D detectors by a significant margin (The improvements under the moderate setting on KITTI validation set are 1.82～ 10.91% mAP in BEV and 1.18～ 9.36% mAP in 3D). Codes have been released at https://github.com/mrsempress/OBMO.

Affiliations: College of Information and Electrical Engineering, National Innovation Center for Digital Fishery, China Agricultural University, Beijing, China; College of Science, National Innovation Center for Digital Fishery, China Agricultural University, Beijing, China; College of Information and Electrical Engineering, National Innovation Center for Digital Fishery, Beijing Engineering and Technology Research Center for Internet of Things in Agriculture, Precision Agricultural Technology Integration Research Base (Fishery), Ministry of Agriculture and Rural Affairs, and the Key Laboratory of Agricultural Information Acquisition, Ministry of Agriculture, China Agricultural University, Beijing, China

Abstract:
Feature-based domain adaptation methods project samples from different domains into the same feature space and try to align the distribution of two domains to learn an effective transferable model. The vital problem is how to find a proper way to reduce the domain shift and improve the discriminability of features. To address the above issues, we propose a unified Probability-based Graph embedding Cross-domain and class Discriminative feature learning framework for unsupervised domain adaptation (PGCD). Specifically, we propose novel graph embedding structures to be the class discriminative transfer feature learning item and cross-domain alignment item, which can make the same-category samples compact in each domain, and fully align the local and global geometric structure across domains. Besides, two theoretical analyses are given to prove the interpretability of the proposed graph structures, which can further describe the relationships between samples to samples in single-domain and cross-domain transfer feature learning scenarios. Moreover, we adopt novel weight strategies via probability information to generate robust centroids in each proposed item to enhance the accuracy of transfer feature learning and reduce the error accumulation. Compared with the advanced approaches by comprehensive experiments, the promising performance on the benchmark datasets verify the effectiveness of the proposed model.

Abstract:
We investigate a novel multi-user mobile Virtual Reality (VR) arcade system for streaming scalable 8K 360° video with low interactive latency, while providing high remote scene immersion fidelity and application reliability. This is achieved through the integration of embedded multi-layer 360° tiling, edge computing, and wireless multi-connectivity that comprises sub-6 GHz and mmWave (millimeter wave) links. The sub-6 GHz band is used for broadcast of the base layer of the entire 360° panorama to all users, while the directed mmWave links are used for high-rate transmission of VR-enhancement layers that are specific to the viewports of the individual users. The viewport-specific enhancements can comprise compressed and raw 360° tiles, decoded first at the edge server. We aim to maximize the smallest immersion fidelity for the delivered 360 content across all VR users, given rate, latency and computing constraints. We characterize analytically the rate-distortion trade-offs across the spatiotemporal 360° panorama and the computing power required to decompress 360° tiles. The proposed solution consists of geometric programming algorithms and an intermediate step of graph-theoretic VR user to mmWave access point assignment. The results reveal a significant improvement (8–10 dB) in delivered VR user immersion fidelity and spatial resolution (8K vs. 4K) compared to a state-of-the-art method based on sub-6 GHz transmission only. We also show that an increasing number of raw 360° tiles are sent, as the mmWave network link data rate or the edge server/user computing power increase. Finally, we demonstrate that in order to hypothetically deliver the same immersion fidelity, the reference method would incur a much higher (2.5-4.5x) system latency.

Abstract:
This paper proposes a decomposition called quaternion scalar and vector norm decomposition (QSVND) for approximation problems in color image processing. Different from traditional quaternion norm approximations that are always the single objective models (SOM), QSVND is adopted to transform the SOM into the bi-objective model (BOM). Furthermore, regularization is used to solve the BOM problem as a common scalarization method, which converts the BOM into a more reasonable SOM. This can handle over-fitting or under-fitting problems neglected in this kind of research for quaternion representation (QR) in color image processing. That is how to treat redundancy caused by the extra scalar part when the vector part of a quaternion is used to represent a color pixel. We apply QSVND to quaternion principal component analysis (QPCA) for color face recognition (FR), which can deal with the phenomenon of under-fitting of vector part norm approximation. Comparisons with the competing approaches on AR, FERET, FEI, and KDEF&AKDEF databases consistently show the superiority of the proposed approach for color FR.

Abstract:
Federated learning is a privacy-preserving distributed learning paradigm where multiple devices collaboratively train a model, which is applicable to edge computing environments. However, the non-IID data distributed in multiple devices degrades the performance of the federated model due to severe weight divergence. This paper presents a clustered federated learning framework named cFedFN for visual classification tasks in order to reduce the degradation. Especially, this framework introduces the computation of feature norm vectors in the local training process and divides the devices into multiple groups by the similarities of the data distributions to reduce the weight divergences for better performance. As a result, this framework gains better performance on non-IID data without leakage of the private raw data. Experiments on various visual classification datasets demonstrate the superiority of this framework over the state-of-the-art clustered federated learning frameworks.

Abstract:
Most recent methods for RGB (red–green–blue)-thermal salient object detection (SOD) involve several floating-point operations and have numerous parameters, resulting in slow inference, especially on common processors, and impeding their deployment on mobile devices for practical applications. To address these problems, we propose a lightweight spatial boosting network (LSNet) for efficient RGB-thermal SOD with a lightweight MobileNetV2 backbone to replace a conventional backbone (e.g., VGG, ResNet). To improve feature extraction using a lightweight backbone, we propose a boundary boosting algorithm that optimizes the predicted saliency maps and reduces information collapse in low-dimensional features. The algorithm generates boundary maps based on predicted saliency maps without incurring additional calculations or complexity. As multimodality processing is essential for high-performance SOD, we adopt attentive feature distillation and selection and propose semantic and geometric transfer learning to enhance the backbone without increasing the complexity during testing. Experimental results demonstrate that the proposed LSNet achieves state-of-the-art performance compared with 14 RGB-thermal SOD methods on three datasets while improving the numbers of floating-point operations (1.025G) and parameters (5.39M), model size (22.1 MB), and inference speed (9.95 fps for PyTorch, batch size of 1, and Intel i5-7500 processor; 93.53 fps for PyTorch, batch size of 1, and NVIDIA TITAN V graphics processor; 936.68 fps for PyTorch, batch size of 20, and graphics processor; 538.01 fps for TensorRT and batch size of 1; and 903.01 fps for TensorRT/FP16 and batch size of 1). The code and results can be found from the link of https://github.com/zyrant/LSNet.

Abstract:
Correlation operation and attention mechanism are two popular feature fusion approaches which play an important role in visual object tracking. However, the correlation-based tracking networks are sensitive to location information but loss some context semantics, while the attention-based tracking networks can make full use of rich semantic information but ignore the position distribution of the tracked object. Therefore, in this paper, we propose a novel tracking framework based on joint correlation and attention networks, termed as JCAT, which can effectively combine the advantages of these two complementary feature fusion approaches. Concretely, the proposed JCAT approach adopts parallel correlation and attention branches to generate position and semantic features. Then the fusion features are obtained by directly adding the location feature and semantic feature. Finally, the fused features are fed into the segmentation network to generate the pixel-wise state estimation of the object. Furthermore, we develop a segmentation memory bank and an online sample filtering mechanism for robust segmentation and tracking. The extensive experimental results on eight challenging visual tracking benchmarks show that the proposed JCAT tracker achieves very promising tracking performance and sets a new state-of-the-art on the VOT2018 benchmark.

Abstract:
In satellite videos, moving vehicles are extremely small-sized and densely clustered in vast scenes. Anchor-free detectors offer great potential by predicting the keypoints and boundaries of objects directly. However, for dense small-sized vehicles, most anchor-free detectors miss the dense objects without considering the density distribution. Furthermore, weak appearance features and massive interference in the satellite videos limit the application of anchor-free detectors. To address these problems, a novel semantic-embedded density adaptive network (SDANet) is proposed. In SDANet, the cluster-proposals, including a variable number of objects, and centers are generated parallelly through pixel-wise prediction. Then, a novel density matching algorithm is designed to obtain each object via partitioning the cluster-proposals and matching the corresponding centers hierarchically and recursively. Meanwhile, the isolated cluster-proposals and centers are suppressed. In SDANet, the road is segmented in vast scenes and its semantic features are embedded into the network by weakly supervised learning, which guides the detector to emphasize the regions of interest. By this way, SDANet reduces the false detection caused by massive interference. To alleviate the lack of appearance information on small-sized vehicles, a customized bi-directional conv-RNN module extracts the temporal information from consecutive input frames by aligning the disturbed background. The experimental results on Jilin-1 and SkySat satellite videos demonstrate the effectiveness of SDANet, especially for dense objects.

Abstract:
To get the high resolution multi-spectral (HRMS) images by the fusion of low resolution multi-spectral (LRMS) and panchromatic (PAN) images, an effectively pansharpening model with spatial Hessian non-convex sparse and spectral gradient low rank priors (PSHNSSGLR) is proposed in this paper. In particularly, from the statistical aspect of view, the spatial Hessian hyper-Laplacian non-convex sparse prior is developed to model the spatial Hessian consistency between HRMS and PAN. More importantly, it is recently the first work for pansharpening modeling with the spatial Hessian hyper-Laplacian non-convex sparse prior. Meanwhile, the spectral gradient low rank prior on HRMS is further developed for spectral feature preservation. Then, the alternating direction method of multipliers (ADMM) approach is applied for optimizing the proposed PSHNSSGLR model. Afterwards, many fusion experiments demonstrate the capability and superiority of PSHNSSGLR.

Abstract:
Attributing to material identification ability powered by a large number of spectral bands, hyperspectral videos (HSVs) have great potential for object tracking. Most hyperspectral trackers employ manually designed features rather than deeply learned features to describe objects due to limited available HSVs for training, leaving a huge gap to improve the tracking performance. In this paper, we propose an end-to-end deep ensemble network (SEE-Net) to address this challenge. Specifically, we first establish a spectral self-expressive model to learn the band correlation, indicating the importance of a single band in forming hyperspectral data. We parameterize the optimization of the model with a spectral self-expressive module to learn the nonlinear mapping from input hyperspectral frames to band importance. In this way, the prior knowledge of bands is transformed into a learnable network architecture, which has high computational efficiency and can fast adapt to the changes of target appearance because of no iterative optimization. The band importance is further exploited from two aspects. On the one hand, according to the band importance, each frame of HSVs is divided into several three-channel false-color images which are then used for deep feature extraction and location. On the other hand, based on the band importance, the importance of each false-color image is computed, which is then used to assemble the tracking results from individual false-color images. In this way, the unreliable tracking caused by false-color images of low importance can be suppressed to a large extent. Extensive experimental results show that SEE-Net performs favorably against the state-of-the-art approaches. The source code will be available at https://github.com/hscv/SEE-Net.

Abstract:
We present the outcomes of a recent large-scale subjective study of Mobile Cloud Gaming Video Quality Assessment (MCG-VQA) on a diverse set of gaming videos. Rapid advancements in cloud services, faster video encoding technologies, and increased access to high-speed, low-latency wireless internet have all contributed to the exponential growth of the Mobile Cloud Gaming industry. Consequently, the development of methods to assess the quality of real-time video feeds to end-users of cloud gaming platforms has become increasingly important. However, due to the lack of a large-scale public Mobile Cloud Gaming Video dataset containing a diverse set of distorted videos with corresponding subjective scores, there has been limited work on the development of MCG-VQA models. Towards accelerating progress towards these goals, we created a new dataset, named the LIVE-Meta Mobile Cloud Gaming (LIVE-Meta-MCG) video quality database, composed of 600 landscape and portrait gaming videos, on which we collected 14,400 subjective quality ratings from an in-lab subjective study. Additionally, to demonstrate the usefulness of the new resource, we benchmarked multiple state-of-the-art VQA algorithms on the database. The new database will be made publicly available on our website: https://live.ece.utexas.edu/research/LIVE-Meta-Mobile-Cloud-Gaming/index.html

Abstract:
Self-supervised learning enables networks to learn discriminative features from massive data itself. Most state-of-the-art methods maximize the similarity between two augmentations of one image based on contrastive learning. By utilizing the consistency of two augmentations, the burden of manual annotations can be freed. Contrastive learning exploits instance-level information to learn robust features. However, the learned information is probably confined to different views of the same instance. In this paper, we attempt to leverage the similarity between two distinct images to boost representation in self-supervised learning. In contrast to instance-level information, the similarity between two distinct images may provide more useful information. Besides, we analyze the relation between similarity loss and feature-level cross-entropy loss. These two losses are essential for most deep learning methods. However, the relation between these two losses is not clear. Similarity loss helps obtain instance-level representation, while feature-level cross-entropy loss helps mine the similarity between two distinct images. We provide theoretical analyses and experiments to show that a suitable combination of these two losses can get state-of-the-art results. Code is available at https://github.com/guijiejie/ICCL.

Abstract:
Imagery collected from outdoor visual environments is often degraded due to the presence of dense smoke or haze. A key challenge for research in scene understanding in these degraded visual environments (DVE) is the lack of representative benchmark datasets. These datasets are required to evaluate state-of-the-art object recognition and other computer vision algorithms in degraded settings. In this paper, we address some of these limitations by introducing the first realistic haze image benchmark, from both aerial and ground view, with paired haze-free images, and in-situ haze density measurements. This dataset was produced in a controlled environment with professional smoke generating machines that covered the entire scene, and consists of images captured from the perspective of both an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV). We also evaluate a set of representative state-of-the-art dehazing approaches as well as object detectors on the dataset. The full dataset presented in this paper, including the ground truth object classification bounding boxes and haze density measurements, is provided for the community to evaluate their algorithms at: https://a2i2-archangel.vision. A subset of this dataset has been used for the “Object Detection in Haze” Track of CVPR UG2 2022 challenge at https://cvpr2022.ug2challenge.org/track1.html.

Abstract:
Inspired by Active Learning and 2D-3D semantic fusion, we proposed a novel framework for 3D scene semantic segmentation based on rendered 2D images, which could efficiently achieve semantic segmentation of any large-scale 3D scene with only a few 2D image annotations. In our framework, we first render perspective images at certain positions in the 3D scene. Then we continuously fine-tune a pre-trained network for image semantic segmentation and project all dense predictions to the 3D model for fusion. In each iteration, we evaluate the 3D semantic model and re-render images in several representative areas where the 3D segmentation is not stable and send them to the network for training after annotation. Through this iterative process of rendering-segmentation-fusion, it can effectively generate difficult-to-segment image samples in the scene, while avoiding complex 3D annotations, so as to achieve label-efficient 3D scene segmentation. Experiments on three large-scale indoor and outdoor 3D datasets demonstrate the effectiveness of the proposed method compared with other state-of-the-art.

Affiliations: College of Information and Communication Engineering and the Key Laboratory of Advanced Marine Communication and Information Technology, Ministry of Industry and Information Technology, Harbin Engineering University, Harbin, China; College of Architectural Engineering, Civil Engineering and Environment, Ningbo University, Ningbo, China; School of Information and Electronics, Beijing Institute of Technology, Beijing, China; School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia

Abstract:
Deep learning (DL) based methods represented by convolutional neural networks (CNNs) are widely used in hyperspectral image classification (HSIC). Some of these methods have strong ability to extract local information, but the extraction of long-range features is slightly inefficient, while others are just the opposite. For example, limited by the receptive fields, CNN is difficult to capture the contextual spectral-spatial features from a long-range spectral-spatial relationship. Besides, the success of DL-based methods is greatly attributed to numerous labeled samples, whose acquisition are time-consuming and cost-consuming. To resolve these problems, a hyperspectral classification framework based on multi-attention Transformer (MAT) and adaptive superpixel segmentation-based active learning (MAT-ASSAL) is proposed, which successfully achieves excellent classification performance, especially under the condition of small-size samples. Firstly, a multi-attention Transformer network is built for HSIC. Specifically, the self-attention module of Transformer is applied to model long-range contextual dependency between spectral-spatial embedding. Moreover, in order to capture local features, an outlook-attention module which can efficiently encode fine-level features and contexts into tokens is utilized to improve the correlation between the center spectral-spatial embedding and its surroundings. Secondly, aiming to train a excellent MAT model through limited labeled samples, a novel active learning (AL) based on superpixel segmentation is proposed to select important samples for MAT. Finally, to better integrate local spatial similarity into active learning, an adaptive superpixel (SP) segmentation algorithm, which can save SPs in uninformative regions and preserve edge details in complex regions, is employed to generate better local spatial constraints for AL. Quantitative and qualitative results indicate that the MAT-ASSAL outperforms seven state-of-the-art methods on three HSI datasets.

Abstract:
We present Skeleton-CutMix, a simple and effective skeleton augmentation framework for supervised domain adaptation and show its advantage in skeleton-based action recognition tasks. Existing approaches usually perform domain adaptation for action recognition with elaborate loss functions that aim to achieve domain alignment. However, they fail to capture the intrinsic characteristics of skeleton representation. Benefiting from the well-defined correspondence between bones of a pair of skeletons, we instead mitigate domain shift by fabricating skeleton data in a mixed domain, which mixes up bones from the source domain and the target domain. The fabricated skeletons in the mixed domain can be used to augment training data and train a more general and robust model for action recognition. Specifically, we hallucinate new skeletons by using pairs of skeletons from the source and target domains; a new skeleton is generated by exchanging some bones from the skeleton in the source domain with corresponding bones from the skeleton in the target domain, which resembles a cut-and-mix operation. When exchanging bones from different domains, we introduce a class-specific bone sampling strategy so that bones that are more important for an action class are exchanged with higher probability when generating augmentation samples for that class. We show experimentally that the simple bone exchange strategy for augmentation is efficient and effective and that distinctive motion features are preserved while mixing both action and style across domains. We validate our method in cross-dataset and cross-age settings on NTU-60 and ETRI-Activity3D datasets with an average gain of over 3% in terms of action recognition accuracy, and demonstrate its superior performance over previous domain adaptation approaches as well as other skeleton augmentation strategies.

Abstract:
End-to-end Long Short-Term Memory (LSTM) has been successfully applied to video summarization. However, the weakness of the LSTM model, poor generalization with inefficient representation learning for inputted nodes, limits its capability to efficiently carry out node classification within user-created videos. Given the power of Graph Neural Networks (GNNs) in representation learning, we adopted the Graph Information Bottle (GIB) to develop a Contextual Feature Transformation (CFT) mechanism that refines the temporal dual-feature, yielding a semantic representation with attention alignment. Furthermore, a novel Salient-Area-Size-based spatial attention model is presented to extract frame-wise visual features based on the observation that humans tend to focus on sizable and moving objects. Lastly, semantic representation is embedded within attention alignment under the end-to-end LSTM framework to differentiate indistinguishable images. Extensive experiments demonstrate that the proposed method outperforms State-Of-The-Art (SOTA) methods.

Abstract:
Salient object detection (SOD) aims to identify the most visually distinctive object(s) from each given image. Most recent progresses focus on either adding elaborative connections among different convolution blocks or introducing boundary-aware supervision to help achieve better segmentation, which is actually moving away from the essence of SOD, i.e., distinctiveness/salience. This paper goes back to the roots of SOD and investigates the principles of how to identify distinctive object(s) in a more effective and efficient way. Intuitively, the salience of one object should largely depend on its global context within the input image. Based on this, we devise a clean yet effective architecture for SOD, named Collaborative Content-Dependent Networks (CCD-Net). In detail, we propose a collaborative content-dependent head whose parameters are conditioned on the input image’s global context information. Within the content-dependent head, a hand-crafted multi-scale (HMS) module and a self-induced (SI) module are carefully designed to collaboratively generate content-aware convolution kernels for prediction. Benefited from the content-dependent head, CCD-Net is capable of leveraging global context to detect distinctive object(s) while keeping a simple encoder-decoder design. Extensive experimental results demonstrate that our CCD-Net achieves state-of-the-art results on various benchmarks. Our architecture is simple and intuitive compared to previous solutions, resulting in competitive characteristics with respect to model complexity, operating efficiency, and segmentation accuracy.

Abstract:
Human Action Recognition plays a driving engine of many human-computer interaction applications. Most current researches focus on improving the model generalization by integrating multiple homogeneous modalities, including RGB images, human poses, and optical flows. Furthermore, contextual interactions and out-of-context sign languages have been validated to depend on scene category and human per se. Those attempts to integrate appearance features and human poses have shown positive results. However, with human poses’ spatial errors and temporal ambiguities, existing methods are subject to poor scalability, limited robustness, and sub-optimal models. In this paper, inspired by the assumption that different modalities may maintain temporal consistency and spatial complementarity, we present a novel Bi-directional Co-temporal and Cross-spatial Attention Fusion Model (B2C-AFM). Our model is characterized by the asynchronous fusion strategy of multi-modal features along temporal and spatial dimensions. Besides, the novel explicit motion-oriented pose representations called Limb Flow Fields (Lff) are explored to alleviate the temporal ambiguity regarding human poses. Experiments on publicly available datasets validate our contributions. Abundant ablation studies experimentally show that B2C-AFM achieves robust performance across seen and unseen human actions. The codes are available at https://github.com/gftww/B2C.git.

Abstract:
Face recognition has achieved remarkable success owing to the development of deep learning. However, most of existing face recognition models perform poorly against pose variations. We argue that, it is primarily caused by pose-based long-tailed data - imbalanced distribution of training samples between profile faces and near-frontal faces. Additionally, self-occlusion and nonlinear warping of facial textures caused by large pose variations also increase the difficulty in learning discriminative features of profile faces. In this study, we propose a novel framework called Symmetrical Siamese Network (SSN), which can simultaneously overcome the limitation of pose-based long-tailed data and pose-invariant features learning. Specifically, two sub-modules are proposed in the SSN, i.e., Feature-Consistence Learning sub-Net (FCLN) and Identity-Consistence Learning sub-Net (ICLN). For FCLN, the inputs are all face images on training dataset. Inspired by the contrastive learning, we simulate pose variations of faces and constrain the model to focus on the consistent areas between the original face image and its corresponding virtual pose face images. For ICLN, only profile images are used as inputs, and we propose to adopt Identity Consistence Loss to minimize the intra-class feature variation across different poses. The collaborative learning of two sub-modules guarantees that the parameters of network are updated in a relatively equal probability between near-frontal face images and profile images, so that the pose-based long-tailed problem can be effectively addressed. The proposed SSN shows comparable results over the state-of-the-art methods on several public datasets. In this study, LightCNN is selected as the backbone of SSN, and existing popular networks also can be used into our framework for pose-robust face recognition.

Abstract:
Counting objects in crowded scenes remains a challenge to computer vision. The current deep learning based approach often formulate it as a Gaussian density regression problem. Such a brute-force regression, though effective, may not consider the annotation displacement properly which arises from the human annotation process and may lead to different distributions. We conjecture that it would be beneficial to consider the annotation displacement in the dense object counting task. To obtain strong robustness against annotation displacement, generalized Gaussian distribution (GGD) function with a tunable bandwidth and shape parameter is exploited to form the learning target point annotation probability map, PAPM. Specifically, we first present a hand-designed PAPM method (HD-PAPM), in which we design a function based on GGD to tolerate the annotation displacement. For end-to-end training, the hand-designed PAPM may not be optimal for the particular network and dataset. An adaptively learned PAPM method (AL-PAPM) is proposed. To improve the robustness to annotation displacement, we design an effective transport cost function based on GGD. The proposed PAPM is capable of integration with other methods. We also combine PAPM with P2PNet through modifying the matching cost matrix, forming P2P-PAPM. This could also improve the robustness to annotation displacement of P2PNet. Extensive experiments show the superiority of our proposed methods.

Abstract:
This paper presents a Semantic Positioning System (SPS) to enhance the accuracy of mobile device geo-localization in outdoor urban environments. Although the traditional Global Positioning System (GPS) can offer a rough localization, it lacks the necessary accuracy for applications such as Augmented Reality (AR). Our SPS integrates Geographic Information System (GIS) data, GPS signals, and visual image information to estimate the 6 Degree-of-Freedom (DoF) pose through cross-view semantic matching. This approach has excellent scalability to support GIS context with Levels of Detail (LOD). The map data representation is Digital Elevation Model (DEM), a cost-effective aerial map that allows for fast deployment for large-scale areas. However, the DEM lacks geometric and texture details, making it challenging for traditional visual feature extraction to establish pixel/voxel level cross-view correspondences. To address this, we sample observation pixels from the query ground-view image using predicted semantic labels. We then propose an iterative homography estimation method with semantic correspondences. To improve the efficiency of the overall system, we further employ a heuristic search to speedup the matching process. The proposed method is robust, real-time, and automatic. Quantitative experiments on the challenging Bund dataset show that we achieve a positioning accuracy of 73.24%, surpassing the baseline skyline-based method by 20%. Compared with the state-of-the-art semantic-based approach on the Kitti dataset, we improve the positioning accuracy by an average of 5%.

Affiliations: Institute of Artificial Intelligence and Blockchain, Guangzhou University, Guangdong, China; Engineering Research Center of Digital Forecasts, Ministry of Education, School of Computer and Software, Nanjing University Information Science and Technology, Nanjing, China; Department of Electrical and Computer Engineering, University of Windsor, Windsor, Canada; School of Automation, Southeast University, Nanjing, China; Department of Electrical and Computer Engineering, Western University, London, Canada

Abstract:
Human action recognition (HAR) is one of most important tasks in video analysis. Since video clips distributed on networks are usually untrimmed, it is required to accurately segment a given untrimmed video into a set of action segments for HAR. As an unsupervised temporal segmentation technology, subspace clustering learns the codes from each video to construct an affinity graph, and then cuts the affinity graph to cluster the video into a set of action segments. However, most of the existing subspace clustering schemes not only ignore the sequential information of frames in code learning, but also the negative effects of noises when cutting the affinity graph, which lead to inferior performance. To address these issues, we propose a sequential order-aware coding-based robust subspace clustering (SOAC-RSC) scheme for HAR. By feeding the motion features of video frames into multi-layer neural networks, two expressive code matrices are learned in a sequential order-aware manner from unconstrained and constrained videos, respectively, to construct the corresponding affinity graphs. Then, with the consideration of the existence of noise effects, a simple yet robust cutting algorithm is proposed to cut the constructed affinity graphs to accurately obtain the action segments for HAR. The extensive experiments demonstrate the proposed SOAC-RSC scheme achieves the state-of-the-art performance on the datasets of Keck Gesture and Weizmann, and provides competitive performance on the other 6 public datasets such as UCF101 and URADL for HAR task, compared to the recent related approaches.

Affiliations: School of Computer Science, Jiangsu Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET), and Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology (NUIST), Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology (NJUST), Nanjing, China; School of Electronic and Electrical Engineering, Sungkyunkwan University, Suwon, South Korea

Abstract:
Low-rank tensor representation philosophy has enjoyed a reputation in many hyperspectral image (HSI) low-level vision applications, but previous studies often failed to comprehensively exploit the low-rank nature of HSI along different modes in low-dimensional subspace, and unsurprisingly handled only one specific task. To address these challenges, in this paper, we figured out that in addition to the spatial correlation, the spectral dependency of HSI also implicitly exists in the coefficient tensor of its subspace, this crucial dependency that was not fully utilized by previous studies yet can be effectively exploited in a cascaded manner. This led us to propose a unified subspace low-rank learning regime with a new tensor cascaded rank minimization, named STCR, to fully couple the low-rankness of HSI in different domains for various low-level vision tasks. Technically, the high-dimensional HSI was first projected into a low-dimensional tensor subspace, then a novel tensor low-cascaded-rank decomposition was designed to collapse the constructed tensor into three core tensors in succession to more thoroughly exploit the correlations in spatial, nonlocal, and spectral modes of the coefficient tensor. Next, difference continuity-regularization was introduced to learn a basis that more closely approximates the HSI’s endmembers. The proposed regime realizes a comprehensive delineation of the self-portrait of HSI tensor. Extensive evaluations conducted with dozens of state-of-the-art (SOTA) baselines on eight datasets verified that the proposed regime is highly effective and robust to typical HSI low-level vision tasks, including denoising, compressive sensing reconstruction, inpainting, and destriping. The source code of our method is released at https://github.com/CX-He/STCR.git.

Abstract:
Blind visual quality assessment (BVQA) on 360° video plays a key role in optimizing immersive multimedia systems. When assessing the quality of 360° video, human tends to perceive its quality degradation from the viewport-based spatial distortion of each spherical frame to motion artifact across adjacent frames, ending with the video-level quality score, i.e., a progressive quality assessment paradigm. However, the existing BVQA approaches for 360° video neglect this paradigm. In this paper, we take into account the progressive paradigm of human perception towards spherical video quality, and thus propose a novel BVQA approach (namely ProVQA) for 360° video via progressively learning from pixels, frames and video. Corresponding to the progressive learning of pixels, frames and video, three sub-nets are designed in our ProVQA approach, i.e., the spherical perception aware quality prediction (SPAQ), motion perception aware quality prediction (MPAQ) and multi-frame temporal non-local (MFTN) sub-nets. The SPAQ sub-net first models the spatial quality degradation based on spherical perception mechanism of human. Then, by exploiting motion cues across adjacent frames, the MPAQ sub-net properly incorporates motion contextual information for quality assessment on 360° video. Finally, the MFTN sub-net aggregates multi-frame quality degradation to yield the final quality score, via exploring long-term quality correlation from multiple frames. The experiments validate that our approach significantly advances the state-of-the-art BVQA performance on 360° video over two datasets, the code of which has been public in https://github.com/yanglixiaoshen/ProVQA.

Abstract:
In this paper, a joint decision tree and visual feature optimization rate control scheme for ultrahigh-definition (UHD) versatile video coding (VVC) is proposed. First, we design a new rate-distortion (R-D) model for UHD videos, and we establish a decision-tree-based multiclass classification scheme to improve the prediction accuracy of the R-D model by fully considering visual features. Second, based on the proposed R-D model, the globally optimal solution is obtained through convex optimization. Finally, we embed our algorithm into the latest VVC reference software, VTM 10.2. According to our experimental results, compared with the latest algorithm in VTM 10.2 and other state-of-the-art algorithms, our method can achieve significant bit rate reductions while maintaining a given peak signal-to-noise ratio (PSNR) or structural similarity index measure (SSIM).

Abstract:
Face photo-sketch synthesis tasks have been dominated by convolutional neural networks (CNNs), especially CNN-based generative adversarial networks (GANs), because of their strong texture modeling capabilities and thus their ability to generate more realistic face photos/sketches beyond traditional methods. However, due to CNNs’ locality and spatial invariance properties, there have weaknesses in capturing the global and structural information which are extremely important for face images. Inspired by the recent phenomenal success of the Transformer in vision tasks, we propose replacing CNNs with Transformers that are able to model long-range dependencies to synthesize more structured and realistic face images. However, the existing vision Transformers are mainly designed for high-level vision tasks and lack the dense prediction ability to generate high resolution images due to the quadratic computational complexity of their self-attention mechanism. In addition, the original Transformer is not capable of modeling local correlations which is an important skill for image generation. To address these challenges, we propose two types of memory-friendly Transformer encoders, one for processing local correlations via local self-attention and another for modeling global information via global self-attention. By integrating the two proposed Transformer encoders, we present an efficient GL-Transformer for face photo-sketch synthesis, which can synthesize realistic face photo/sketch images from coarse to fine. Extensive experiments demonstrate that our model achieves a comparable or better performance beyond the state-of-the-art CNN-based methods both qualitatively and quantitatively.

Abstract:
Deep learning has demonstrated its power in image rectification by leveraging the representation capacity of deep neural networks via supervised training based on a large-scale synthetic dataset. However, the model may overfit the synthetic images and generalize not well on real-world fisheye images due to the limited universality of a specific distortion model and the lack of explicitly modeling the distortion and rectification process. In this paper, we propose a novel self-supervised image rectification (SIR) method based on an important insight that the rectified results of distorted images of a same scene from different lenses should be the same. Specifically, we devise a new network architecture with a shared encoder and several prediction heads, each of which predicts the distortion parameter of a specific distortion model. We further leverage a differentiable warping module to generate the rectified images and re-distorted images from the distortion parameters and exploit the intra- and inter-model consistency between them during training, thereby leading to a self-supervised learning scheme without the need for ground-truth distortion parameters or normal images. Experiments on synthetic dataset and real-world fisheye images demonstrate that our method achieves comparable or even better performance than the supervised baseline method and representative state-of-the-art (SOTA) methods. The proposed self-supervised method also provides a possible way to improve the universality of distortion models while keeping their self-consistency. Code and datasets will be available at https://github.com/loong8888/SIR.

Abstract:
Existing deraining methods focus mainly on a single input image. However, with just a single input image, it is extremely difficult to accurately detect and remove rain streaks, in order to restore a rain-free image. In contrast, a light field image (LFI) embeds abundant 3D structure and texture information of the target scene by recording the direction and position of each incident ray via a plenoptic camera. LFIs are becoming popular in the computer vision and graphics communities. However, making full use of the abundant information available from LFIs, such as 2D array of sub-views and the disparity map of each sub-view, for effective rain removal is still a challenging problem. In this paper, we propose a novel method, 4D-MGP-SRRNet, for rain streak removal from LFIs. Our method takes as input all sub-views of a rainy LFI. To make full use of the LFI, it adopts 4D convolutional layers to simultaneously process all sub-views of the LFI. In the pipeline, the rain detection network, MGPDNet, with a novel Multi-scale Self-guided Gaussian Process (MSGP) module is proposed to detect high-resolution rain streaks from all sub-views of the input LFI at multi-scales. Semi-supervised learning is introduced for MSGP to accurately detect rain streaks by training on both virtual-world rainy LFIs and real-world rainy LFIs at multi-scales via computing pseudo ground truths for real-world rain streaks. We then feed all sub-views subtracting the predicted rain streaks into a 4D convolution-based Depth Estimation Residual Network (DERNet) to estimate the depth maps, which are later converted into fog maps. Finally, all sub-views concatenated with the corresponding rain streaks and fog maps are fed into a powerful rainy LFI restoring model based on the adversarial recurrent neural network to progressively eliminate rain streaks and recover the rain-free LFI. Extensive quantitative and qualitative evaluations conducted on both synthetic LFIs and real-world LFIs demonstrate the effectiveness of our proposed method.

Abstract:
Digital images often suffer from the common problem of stripe noise due to the inconsistent bias of each column. The existence of the stripe poses much more difficulties on image denoising since it requires another n parameters, where n is the width of the image, to characterize the total interference of the observed image. This paper proposes a novel EM-based framework for simultaneous stripe estimation and image denoising. The great benefit of the proposed framework is that it splits the overall destriping and denoising problem into two independent sub-problems, i.e., calculating the conditional expectation of the true image given the observation and the estimated stripe from the last round of iteration, and estimating the column means of the residual image, such that a Maximum Likelihood Estimation (MLE) is guaranteed and it does not require any explicit parametric modeling of image priors. The calculation of the conditional expectation is the key, here we choose a modified Non-Local Means algorithm to calculate the conditional expectation because it has been proven to be a consistent estimator under some conditions. Besides, if we relax the consistency requirement, the conditional expectation could be interpreted as a general image denoiser. Therefore other state-of-the-art image denoising algorithms have the potentials to be incorporated into the proposed framework. Extensive experiments have demonstrated the superior performance of the proposed algorithm and provide some promising results that motivate future research on the EM-based destriping and denoising framework.

Abstract:
In this work, we propose a new deep image compression framework called Complexity and Bitrate Adaptive Network (CBANet) that aims to learn one single network to support variable bitrate coding under various computational complexity levels. In contrast to the existing state-of-the-art learning-based image compression frameworks that only consider the rate-distortion trade-off without introducing any constraint related to the computational complexity, our CBANet considers the complex rate-distortion-complexity trade-off when learning a single network to support multiple computational complexity levels and variable bitrates. Since it is a non-trivial task to solve such a rate-distortion-complexity related optimization problem, we propose a two-step approach to decouple this complex optimization task into a complexity-distortion optimization sub-task and a rate-distortion optimization sub-task, and additionally propose a new network design strategy by introducing a Complexity Adaptive Module (CAM) and a Bitrate Adaptive Module (BAM) to respectively achieve the complexity-distortion and rate-distortion trade-offs. As a general approach, our network design strategy can be readily incorporated into different deep image compression methods to achieve complexity and bitrate adaptive image compression by using a single network. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of our CBANet for deep image compression. Code is released at https://github.com/JinyangGuo/CBANet-release.

Abstract:
One-class classification aims to learn one-class models from only in-class training samples. Because of lacking out-of-class samples during training, most conventional deep learning based methods suffer from the feature collapse problem. In contrast, contrastive learning based methods can learn features from only in-class samples but are hard to be end-to-end trained with one-class models. To address the aforementioned problems, we propose alternating direction method of multipliers based sparse representation network (ADMM-SRNet). ADMM-SRNet contains the heterogeneous contrastive feature (HCF) network and the sparse dictionary (SD) network. The HCF network learns in-class heterogeneous contrastive features by using contrastive learning with heterogeneous augmentations. Then, the SD network models the distributions of the in-class training samples by using dictionaries computed based on ADMM. By coupling the HCF network, SD network and the proposed loss functions, our method can effectively learn discriminative features and one-class models of the in-class training samples in an end-to-end trainable manner. Experimental results show that the proposed method outperforms state-of-the-art methods on CIFAR-10, CIFAR-100 and ImageNet-30 datasets under one-class classification settings. Code is available at https://github.com/nchucvml/ADMM-SRNet.

Abstract:
In this work, we address the challenging task of few-shot and zero-shot 3D point cloud semantic segmentation. The success of few-shot semantic segmentation in 2D computer vision is mainly driven by the pre-training on large-scale datasets like imagenet. The feature extractor pre-trained on large-scale 2D datasets greatly helps the 2D few-shot learning. However, the development of 3D deep learning is hindered by the limited volume and instance modality of datasets due to the significant cost of 3D data collection and annotation. This results in less representative features and large intra-class feature variation for few-shot 3D point cloud segmentation. As a consequence, directly extending existing popular prototypical methods of 2D few-shot classification/segmentation into 3D point cloud segmentation won’t work as well as in 2D domain. To address this issue, we propose a Query-Guided Prototype Adaption (QGPA) module to adapt the prototype from support point clouds feature space to query point clouds feature space. With such prototype adaption, we greatly alleviate the issue of large feature intra-class variation in point cloud and significantly improve the performance of few-shot 3D segmentation. Besides, to enhance the representation of prototypes, we introduce a Self-Reconstruction (SR) module that enables prototype to reconstruct the support mask as well as possible. Moreover, we further consider zero-shot 3D point cloud semantic segmentation where there is no support sample. To this end, we introduce category words as semantic information and propose a semantic-visual projection model to bridge the semantic and visual spaces. Our proposed method surpasses state-of-the-art algorithms by a considerable 7.90% and 14.82% under the 2-way 1-shot setting on S3DIS and ScanNet benchmarks, respectively.

Abstract:
Restricted by observation conditions, some scarce targets in the synthetic aperture radar (SAR) image only have a few samples, making effective classification a challenging task. Although few-shot SAR target classification methods originated from meta-learning have made great breakthroughs recently, they only focus on object-level (global) feature extraction while ignoring part-level (local) features, resulting in degraded performance in fine-grained classification. To tackle this issue, a novel few-shot fine-grained classification framework, dubbed as HENC, is proposed in this article. In HENC, the hierarchical embedding network (HEN) is designed for the extraction of multi-scale features from both object-level and part-level. In addition, scale-channels are constructed to realize joint inference of multi-scale features. Moreover, it is observed that the existing meta-learning-based method only implicitly utilize the information of multiple base categories to construct the feature space of novel categories, resulting in scattered feature distribution and large deviation during novel center estimation. In view of this, the center calibration algorithm is proposed to explore the center information of base categories and explicitly calibrate the novel centers by dragging them closer to the real ones. Experimental results on two open benchmark datasets demonstrate that the HENC significantly improves the classification accuracy for SAR targets.

Abstract:
Non-convex relaxation methods have been widely used in tensor recovery problems, compared with convex relaxation methods, and can achieve better recovery results. In this paper, a new non-convex function, Minimax Logarithmic Concave Penalty (MLCP) function, is proposed, and some of its intrinsic properties are analyzed, among which it is interesting to find that the Logarithmic function is an upper bound of the MLCP function. The proposed function is generalized to tensor cases, yielding tensor MLCP and weighted tensor L\gamma -norm. Consider that its explicit solution cannot be obtained when applying it directly to the tensor recovery problem. Therefore, the corresponding equivalence theorems to solve the such problem are given, namely, tensor equivalent MLCP theorem and equivalent weighted tensor L\gamma -norm theorem. In addition, we propose two EMLCP-based models for classic tensor recovery problems, namely low-rank tensor completion (LRTC) and tensor robust principal component analysis (TRPCA), and design proximal alternate linearization minimization (PALM) algorithms to solve them individually. Furthermore, based on the Kurdyka-Åasiwicz property, it is proved that the solution sequence of the proposed algorithm has a finite length and converges to the critical point globally. Finally, extensive experiments show that the proposed algorithm achieves good results, and it is confirmed that the MLCP function is indeed better than the Logarithmic function in the minimization problem, which is consistent with the analysis of theoretical properties.

Abstract:
One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms: S patial-wise B oxAlign attention (SB) for low-level spatial locations, C ontext-wise A ffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the L ow and H igh-level F usion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention’s computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on AP_S index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.

Abstract:
Volumetric (3D) ultrasound imaging using a 2D matrix array probe is increasingly developed for various clinical procedures. However, 3D ultrasound imaging suffers from motion artifacts due to tissue motions and a relatively low frame rate. Current Doppler-based motion compensation (MoCo) methods only allow 1D compensation in the in-range dimension. In this work, we propose a new 3D-MoCo framework that combines 3D velocity field estimation and a two-step compensation strategy for 3D diverging wave compounding imaging. Specifically, our framework explores two constraints of a round-trip scan sequence of 3D diverging waves, i.e., Doppler and pair-wise optical flow, to formulate the estimation of the 3D velocity fields as a global optimization problem, which is further regularized by the divergence-free and first-order smoothness. The two-step compensation strategy is to first compensate for the 1D displacements in the in-range dimension and then the 2D displacements in the two mutually orthogonal cross-range dimensions. Systematical in-silico experiments were conducted to validate the effectiveness of our proposed 3D-MoCo method. The results demonstrate that our 3D-MoCo method achieves higher image contrast, higher structural similarity, and better speckle patterns than the corresponding 1D-MoCo method. Particularly, the 2D cross-range compensation is effective for fully recovering image quality.

Abstract:
Breast tumor segmentation of ultrasound images provides valuable information of tumors for early detection and diagnosis. Accurate segmentation is challenging due to low image contrast between areas of interest; speckle noises, and large inter-subject variations in tumor shape and size. This paper proposes a novel Multi-scale Dynamic Fusion Network (MDF-Net) for breast ultrasound tumor segmentation. It employs a two-stage end-to-end architecture with a trunk sub-network for multiscale feature selection and a structurally optimized refinement sub-network for mitigating impairments such as noise and inter-subject variation via better feature exploration and fusion. The trunk network is extended from UNet++ with a simplified skip pathway structure to connect the features between adjacent scales. Moreover, deep supervision at all scales, instead of at the finest scale in UNet++, is proposed to extract more discriminative features and mitigate errors from speckle noise via a hybrid loss function. Unlike previous works, the first stage is linked to a loss function of the second stage so that both the preliminary segmentations and refinement subnetworks can be refined together at training. The refinement sub-network utilizes a structurally optimized MDF mechanism to integrate preliminary segmentation information (capturing general tumor shape and size) at coarse scales and explores inter-subject variation information at finer scales. Experimental results from two public datasets show that the proposed method achieves better Dice and other scores over state-of-the-art methods. Qualitative analysis also indicates that our proposed network is more robust to tumor size/shapes, speckle noise and heavy posterior shadows along tumor boundaries. An optional post-processing step is also proposed to facilitate users in mitigating segmentation artifacts. The efficiency of the proposed network is also illustrated on the “Electron Microscopy neural structures segmentation dataset”. It outperforms a state-of-the-art algorithm based on UNet-2022 with simpler settings. This indicates the advantages of our MDF-Nets in other challenging image segmentation tasks with small to medium data sizes.

Abstract:
Video coding algorithms attempt to minimize the significant commonality that exists within a video sequence. Each new video coding standard contains tools that can perform this task more efficiently compared to its predecessors. Modern video coding systems are block-based wherein commonality modeling is carried out only from the perspective of the block that need be coded next. In this work, we argue for a commonality modeling approach that can provide a seamless blending between global and local homogeneity information in terms of motion. For this purpose, at first a prediction of the current frame, the frame that need be coded, is generated by performing a two-step discrete cosine basis oriented (DCO) motion modeling. The DCO motion model is employed rather than traditional translational or affine motion model since it has the ability to efficiently model complex motion fields by providing a smooth and sparse representation. Moreover, the proposed two-step motion modeling approach can yield better motion compensation at a reduced computational complexity since an informed guess is designed for initializing the motion search procedure. After that the current frame is partitioned into rectangular regions and the conformance of these regions to the learned motion model is investigated. Depending on the non-conformance to the estimated global motion model, an additional DCO motion model is introduced to increase the local motion homogeneity. In this way, the proposed approach generates a motion compensated prediction of the current frame through the minimization of both global and local motion commonality. Experimental results show an improved rate-distortion performance of a reference high efficiency video coding (HEVC) encoder, specifically up to around 9% savings in bit rate, that employs the DCO prediction frame as a reference frame for encoding the current frame. When compared to the more recent video coding standard, the versatile video coding (VVC) encoder, a bit rate savings of 2.37% is reported.

Abstract:
Cross-domain pedestrian detection aims to generalize pedestrian detectors from one label-rich domain to another label-scarce domain, which is crucial for various real-world applications. Most recent works focus on domain alignment to train domain-adaptive detectors either at the instance level or image level. From a practical point of view, one-stage detectors are faster. Therefore, we concentrate on designing a cross-domain algorithm for rapid one-stage detectors that lacks instance-level proposals and can only perform image-level feature alignment. However, pure image-level feature alignment causes the foreground-background misalignment issue to arise, i.e., the foreground features in the source domain image are falsely aligned with background features in the target domain image. To address this issue, we systematically analyze the importance of foreground and background in image-level cross-domain alignment, and learn that background plays a more critical role in image-level cross-domain alignment. Therefore, we focus on cross-domain background feature alignment while minimizing the influence of foreground features on the cross-domain alignment stage. This paper proposes a novel framework, namely, background-focused distribution alignment (BFDA), to train domain adaptive one-stage pedestrian detectors. Specifically, BFDA first decouples the background features from the whole image feature maps and then aligns them via a novel long-short-range discriminator. Extensive experiments demonstrate that compared to mainstream domain adaptation technologies, BFDA significantly enhances cross-domain pedestrian detection performance for either one-stage or two-stage detectors. Moreover, by employing the efficient one-stage detector (YOLOv5), BFDA can reach 217.4 FPS ( 640× 480 pixels) on NVIDIA Tesla V100 (7~12 times the FPS of the existing frameworks), which is highly significant for practical applications. The code from this study will be made publicly available.

Abstract:
Near-infrared and visible face recognition (NIR-VIS) is attracting increasing attention because of the need to achieve face recognition in low-light conditions to enable 24-hour secure retrieval. However, annotating identity labels for a large number of heterogeneous face images is time-consuming and expensive, which limits the application of the NIR-VIS face recognition system to larger scale real-world scenarios. In this paper, we attempt to achieve NIR-VIS face recognition in an unsupervised domain adaptation manner. To get rid of the reliance on manual annotations, we propose a novel Robust cross-domain Pseudo-labeling and Contrastive learning (RPC) network which consists of three key components, i.e., NIR cluster-based Pseudo labels Sharing (NPS), Domain-specific cluster Contrastive Learning (DCL) and Inter-domain cluster Contrastive Learning (ICL). Firstly, NPS is presented to generate pseudo labels by exploring robust NIR clusters and sharing reliable label knowledge with VIS domain. Secondly, DCL is designed to learn intra-domain compact yet discriminative representations. Finally, ICL dynamically combines and refines intrinsic identity relationships to guide the instance-level features to learn robust and domain-independent representations. Extensive experiments are conducted to verify an accuracy of over 99% in pseudo label assignment and the advanced performance of RPC network on four mainstream NIR-VIS datasets.

Abstract:
Rate control plays an important role in video coding and has attracted lots of attention from researchers. However, the problems of human visual experience and buffer stability still remain. For scenes with drastic motions, parts of distortions can be masked due to the limitation of the Human Visual System (HVS), while buffers tend to suffer more overflow and underflow cases from the fluctuating bits. In this paper, we propose a novel joint rate control scheme, which is composed of the proposed SUR-based perception modeling and the proposed SUR-based Perception-Buffer Rate Control (PBRC), for HEVC to maximize human visual perception quality while preventing the underflow and overflow of buffers. First of all, to effectively model human visual quality, we introduce the perception-related Satisfied-User-Ratio (SUR) metric into the rate control process. Secondly, a time-efficient video quality prediction method called Fast Visual Multimethod Assessment Fusion (VMAF) Quality Prediction (FVQP) is designed for the generation of SUR curves within an affordable computational complexity. Thirdly, a dual-objective optimization framework is established. By jointly conducting perception modeling and PBRC, we can flexibly adjust the optimization priority between human visual quality and buffer stability, and thus the quality of achieved reconstructed videos can be effectively improved because of the decrease in frame skipping. Experimental results demonstrate that the proposed joint rate control scheme improves the human visual experience when considering frame skipping and more effectively stabilizes buffer stability than existing methods.

Abstract:
Point-based 3D detection approaches usually suffer from the severe point sampling imbalance problem between foreground and background. We observe that prior works have attempted to alleviate this imbalance by emphasizing foreground sampling. However, even adequate foreground sampling may be extremely unbalanced between nearby and distant objects, yielding unsatisfactory performance in detecting distant objects. To tackle this issue, this paper first proposes a novel method named Distant Object Augmented Set Abstraction and Regression (DO-SA&R) to enhance distant object detection, which is vital for the timely response of decision-making systems like autonomous driving. Technically, our approach first designs DO-SA with novel distant object augmented farthest point sampling (DO-FPS) to emphasize sampling on distant objects by leveraging both object-dependent and depth-dependent information. Then, we propose distant object augmented regression to reweight all the instance boxes for strengthening regression training on distant objects. In practice, the proposed DO-SA&R can be easily embedded into the existing modules, yielding consistent performance improvements, especially on detecting distant objects. Extensive experiments are conducted on the popular KITTI, nuScenes and Waymo datasets, and DO-SA&R demonstrates superior performance, especially for distant object detection. Our code is available at https://github.com/mikasa3lili/DO-SAR.

Abstract:
As a key problem of auto-vehicle applications, the goal of Anomaly Obstacle Segmentation (AOS) is to detect some strange and unexpected obstacles (possibly are unseen previously) on the drivable area, thereby equipping the semantic perceptual model to be tolerant of unknown things. Due to its practicality, recently AOS is drawing attentions and a long line of works are proposed to tackle the obstacles with almost infinite diversity. However, these methods usually focus less on the priors of driving scenarios and involve image re-generation or the retraining of perceptual model, which lead to large computational quantity or the degradation of perceptual performance. In this paper, we propose to pay more attention to the characteristics of driving scenarios, lowering the difficulty of this tricky task. A training-free retrieval based method is thereby proposed to distinguish road obstacles from the surrounding road texture by computing the cosine similarity based on their appearance features, and significantly outperforms methods of the same category by around 20 percentage points. Besides, we find that there is a deep relation between our method and self-attention mechanism, and as a result a novel Transformer evolves from our retrieval based method, further boosting the performance.

Abstract:
Scene graph generation (SGG) and human-object interaction (HOI) detection are two important visual tasks aiming at localising and recognising relationships between objects, and interactions between humans and objects, respectively. Prevailing works treat these tasks as distinct tasks, leading to the development of task-specific models tailored to individual datasets. However, we posit that the presence of visual relationships can furnish crucial contextual and intricate relational cues that significantly augment the inference of human-object interactions. This motivates us to think if there is a natural intrinsic relationship between the two tasks, where scene graphs can serve as a source for inferring human-object interactions. In light of this, we introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a relation Transformer tasked with generating relation triples from a suite of visual features. Subsequently, we employ another transformer-based decoder to predict human-object interactions based on the generated relation triples. A comprehensive series of experiments conducted across established benchmark datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the compelling performance of our SG2HOI+ model in comparison to prevalent one-stage SGG models. Remarkably, our approach achieves competitive performance when compared to state-of-the-art HOI methods. Additionally, we observe that our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner yields substantial improvements for both tasks compared to individualized training paradigms.

Abstract:
Due to the imaging mechanism of time-of-flight (ToF) sensors, the captured depth images usually suffer from severe noise and degradation. Though many RGB-guided methods have been proposed for depth image enhancement in the past few years, yet the enhancement performance on real-world depth images is still largely unsatisfactory. Two main reasons are the complexity of realistic noise and degradation in depth images, and the difficulty in collecting noise-clean pairs for supervised enhancement learning. This work aims to develop a self-supervised learning method for RGB-guided depth image enhancement, which does not require any noisy-clean pairs but can significantly boost the enhancement performance on real-world noisy depth images. To this end, we exploit the dependency between RGB and depth images to self-supervise the learning of the enhancement model. It is achieved by maximizing the cross-modal dependency between RGB and depth to promote the enhanced depth having dependency with the RGB of the same scene as much as possible. Furthermore, we augment the cross-modal dependency maximization formulation based on the optimal transport theory to achieve further performance improvement. Experimental results on both synthetic and real-world data demonstrate that our method can significantly outperform existing state-of-the-art methods on depth denoising, multi-path interference suppression, and hole filling. Particularly, our method shows remarkable superiority over existing ones on real-world data in handling various realistic complex degradation. Code is available at https://github.com/wjcyt/SRDE.

Abstract:
Single image dehazing is a challenging and ill-posed problem due to severe information degeneration of images captured in hazy conditions. Remarkable progresses have been achieved by deep-learning based image dehazing methods, where residual learning is commonly used to separate the hazy image into clear and haze components. However, the nature of low similarity between haze and clear components is commonly neglected, while the lack of constraint of contrastive peculiarity between the two components always restricts the performance of these approaches. To deal with these problems, we propose an end-to-end self-regularized network (TUSR-Net) which exploits the contrastive peculiarity of different components of the hazy image, i.e, self-regularization (SR). In specific, the hazy image is separated into clear and hazy components and constraint between different image components, i.e., self-regularization, is leveraged to pull the recovered clear image closer to groundtruth, which largely promotes the performance of image dehazing. Meanwhile, an effective triple unfolding framework combined with dual feature to pixel attention is proposed to intensify and fuse the intermediate information in feature, channel and pixel levels, respectively, thus features with better representational ability can be obtained. Our TUSR-Net achieves better trade-off between performance and parameter size with weight-sharing strategy and is much more flexible. Experiments on various benchmarking datasets demonstrate the superiority of our TUSR-Net over state-of-the-art single image dehazing methods.

Abstract:
In this study, the problem of computing a sparse representation of multi-dimensional visual data is considered. In general, such data e.g., hyperspectral images, color images or video data consists of signals that exhibit strong local dependencies. A new computationally efficient sparse coding optimization problem is derived by employing regularization terms that are adapted to the properties of the signals of interest. Exploiting the merits of the learnable regularization techniques, a neural network is employed to act as structure prior and reveal the underlying signal dependencies. To solve the optimization problem Deep unrolling and Deep equilibrium based algorithms are developed, forming highly interpretable and concise deep-learning-based architectures, that process the input dataset in a block-by-block fashion. Extensive simulation results, in the context of hyperspectral image denoising, are provided, which demonstrate that the proposed algorithms outperform significantly other sparse coding approaches and exhibit superior performance against recent state-of-the-art deep-learning-based denoising models. In a wider perspective, our work provides a unique bridge between a classic approach, that is the sparse representation theory, and modern representation tools that are based on deep learning modeling.

Abstract:
Advanced Siamese visual object tracking architectures are jointly trained using pair-wise input images to perform target classification and bounding box regression. They have achieved promising results in recent benchmarks and competitions. However, the existing methods suffer from two limitations: First, though the Siamese structure can estimate the target state in an instance frame, provided the target appearance does not deviate too much from the template, the detection of the target in an image cannot be guaranteed in the presence of severe appearance variations. Second, despite the classification and regression tasks sharing the same output from the backbone network, their specific modules and loss functions are invariably designed independently, without promoting any interaction. Yet, in a general tracking task, the centre classification and bounding box regression tasks are collaboratively working to estimate the final target location. To address the above issues, it is essential to perform target-agnostic detection so as to promote cross-task interactions in a Siamese-based tracking framework. In this work, we endow a novel network with a target-agnostic object detection module to complement the direct target inference, and to avoid or minimise the misalignment of the key cues of potential template-instance matches. To unify the multi-task learning formulation, we develop a cross-task interaction module to ensure consistent supervision of the classification and regression branches, improving the synergy of different branches. To eliminate potential inconsistencies that may arise within a multi-task architecture, we assign adaptive labels, rather than fixed hard labels, to supervise the network training more effectively. The experimental results obtained on several benchmarks, i.e., OTB100, UAV123, VOT2018, VOT2019, and LaSOT, demonstrate the effectiveness of the advanced target detection module, as well as the cross-task interaction, exhibiting superior tracking performance as compared with the state-of-the-art tracking methods.

Abstract:
In 3D face reconstruction, orthogonal projection has been widely employed to substitute perspective projection to simplify the fitting process. This approximation performs well when the distance between camera and face is far enough. However, in some scenarios that the face is very close to camera or moving along the camera axis, the methods suffer from the inaccurate reconstruction and unstable temporal fitting due to the distortion under the perspective projection. In this paper, we aim to address the problem of single-image 3D face reconstruction under perspective projection. Specifically, a deep neural network, Perspective Network (PerspNet), is proposed to simultaneously reconstruct 3D face shape in canonical space and learn the correspondence between 2D pixels and 3D points, by which the 6DoF (6 Degrees of Freedom) face pose can be estimated to represent perspective projection. Besides, we contribute a large ARKitFace dataset to enable the training and evaluation of 3D face reconstruction solutions under the scenarios of perspective projection, which has 902,724 2D facial images with ground-truth 3D face mesh and annotated 6DoF pose parameters. Experimental results show that our approach outperforms current state-of-the-art methods by a significant margin. The code and data are available at https://github.com/cbsropenproject/6dof_face.

Abstract:
Recently, memory-based methods have achieved remarkable progress in video object segmentation. However, the segmentation performance is still limited by error accumulation and redundant memory, primarily because of 1) the semantic gap caused by similarity matching and memory reading via heterogeneous key-value encoding; 2) the continuously growing and inaccurate memory through directly storing unreliable predictions of all previous frames. To address these issues, we propose an efficient, effective, and robust segmentation method based on Isogenous Memory Sampling and Frame-Relation mining (IMSFR). Specifically, by utilizing an isogenous memory sampling module, IMSFR consistently conducts memory matching and reading between sampled historical frames and the current frame in an isogenous space, minimizing the semantic gap while speeding up the model through an efficient random sampling. Furthermore, to avoid key information loss during the sampling process, we further design a frame-relation temporal memory module to mine inter-frame relations, thereby effectively preserving contextual information from the video sequence and alleviating error accumulation. Extensive experiments demonstrate the effectiveness and efficiency of the proposed IMSFR method. In particular, our IMSFR achieves state-of-the-art performance on six commonly used benchmarks in terms of region similarity & contour accuracy and speed. Our model also exhibits strong robustness against frame sampling due to its large receptive field.

Abstract:
Hyperspectral (HS) imaging has been widely used in various real application problems. However, due to the hardware limitations, the obtained HS images usually have low spatial resolution, which could obviously degrade their performance. Through fusing a low spatial resolution HS image with a high spatial resolution auxiliary image (e.g., multispectral, RGB or panchromatic image), the so-called HS image fusion has underpinned much of recent progress in enhancing the spatial resolution of HS image. Nonetheless, a corresponding well registered auxiliary image cannot always be available in some real situations. To remedy this issue, we propose in this paper a newly single HS image super-resolution method based on a novel knowledge-driven deep unrolling technique. Precisely, we first propose a maximum a posterior based energy model with implicit priors, which can be solved by alternating optimization to determine an elementary iteration mechanism. We then unroll such iteration mechanism with an ingenious Transformer embedded convolutional recurrent neural network in which two structural designs are integrated. That is, the vision Transformer and 3D convolution learn the implicit spatial-spectral priors, and the recurrent hidden connections over iterations model the recurrence of the iterative reconstruction stages. Thus, an effective knowledge-driven, end-to-end and data-dependent HS image super-resolution framework can be successfully attained. Extensive experiments on three HS image datasets demonstrate the superiority of the proposed method over several state-of-the-art HS image super-resolution methods.

Abstract:
Binary neural network (BNN) provides a promising solution to deploy parameter-intensive deep single image super-resolution (SISR) models onto real devices with limited storage and computational resources. To achieve comparable performance with the full-precision counterpart, most existing BNNs for SISR mainly focus on compensating for the information loss incurred by binarizing weights and activations in the network through better approximations to the binarized convolution. In this study, we revisit the difference between BNNs and their full-precision counterparts and argue that the key to good generalization performance of BNNs lies on preserving a complete full-precision information flow along with an accurate gradient flow passing through each binarized convolution layer. Inspired by this, we propose to introduce a full-precision skip connection, or a variant thereof, over each binarized convolution layer across the entire network, which can increase the forward expressive capability and the accuracy of back-propagated gradient, thus enhancing the generalization performance. More importantly, such a scheme can be applied to any existing BNN backbones for SISR without introducing any additional computation cost. To validate the efficacy of the proposed approach, we evaluate it using four different backbones for SISR on four benchmark datasets and report obviously superior performance over existing BNNs and even some 4-bit competitors.

Abstract:
The efforts in compressive sensing (CS) literature can be divided into two groups: finding a measurement matrix that preserves the compressed information at its maximum level, and finding a robust reconstruction algorithm. In the traditional CS setup, the measurement matrices are selected as random matrices, and optimization-based iterative solutions are used to recover the signals. Using random matrices when handling large or multi-dimensional signals is cumbersome especially when it comes to iterative optimizations. Recent deep learning-based solutions increase reconstruction accuracy while speeding up recovery, but jointly learning the whole measurement matrix remains challenging. For this reason, state-of-the-art deep learning CS solutions such as convolutional compressive sensing network (CSNET) use block-wise CS schemes to facilitate learning. In this work, we introduce a separable multi-linear learning of the CS matrix by representing the measurement signal as the summation of the arbitrary number of tensors. As compared to block-wise CS, tensorial learning eases blocking artifacts and improves performance, especially at low measurement rates (MRs), such as \text MRs < 0.1 . The software implementation of the proposed network is publicly shared at https://github.com/mehmetyamac/GTSNET.

Abstract:
Color plays an important role in human visual perception, reflecting the spectrum of objects. However, the existing infrared and visible image fusion methods rarely explore how to handle multi-spectral/channel data directly and achieve high color fidelity. This paper addresses the above issue by proposing a novel method with diffusion models, termed as Dif-Fusion, to generate the distribution of the multi-channel input data, which increases the ability of multi-source information aggregation and the fidelity of colors. In specific, instead of converting multi-channel images into single-channel data in existing fusion methods, we create the multi-channel data distribution with a denoising network in a latent space with forward and reverse diffusion process. Then, we use the the denoising network to extract the multi-channel diffusion features with both visible and infrared information. Finally, we feed the multi-channel diffusion features to the multi-channel fusion module to directly generate the three-channel fused image. To retain the texture and intensity information, we propose multi-channel gradient loss and intensity loss. Along with the current evaluation metrics for measuring texture and intensity fidelity, we introduce Delta E as a new evaluation metric to quantify color fidelity. Extensive experiments indicate that our method is more effective than other state-of-the-art image fusion methods, especially in color fidelity. The source code is available at https://github.com/GeoVectorMatrix/Dif-Fusion.

Abstract:
Denoising and demosaicking long-wave infrared (LWIR) division-of-focal-plane (DoFP) polarization images are crucial for various vision applications. However, existing methods rely on the sequential application of individual denoising and demosaicking processes, which may result in the accumulation of errors produced by each process. To address this issue, we propose a joint denoising and demosaicking method for LWIR DoFP images based on a three-stage progressive deep convolutional neural network. To ensure the generalization ability of this network, it is essential to have adequate training data that closely resembles real data. Therefore, we model the complex noise sources that affect LWIR DoFP images as mixed Poisson-Additive-Stripe noise and construct a least-squares problem based on the polarization measurement redundancy error to estimate the parameters of this model on real images. Subsequently, the estimated noise parameters are used to generate training data that enables the network to learn accurate polarization image statistics and improve its generalization ability. The experimental results demonstrate the effectiveness of the proposed method in enhancing the image restoration performance on real LWIR DoFP polarization data.

Abstract:
Facial age estimation has received a lot of attention for its diverse application scenarios. Most existing studies treat each sample equally and aim to reduce the average estimation error for the entire dataset, which can be summarized as General Age Estimation. However, due to the long-tailed distribution prevalent in the dataset, treating all samples equally will inevitably bias the model toward the head classes (usually the adult with a majority of samples). Driven by this, some works suggest that each class should be treated equally to improve performance in tail classes (with a minority of samples), which can be summarized as Long-tailed Age Estimation. However, Long-tailed Age Estimation usually faces a performance trade-off, i.e., achieving improvement in tail classes by sacrificing the head classes. In this paper, our goal is to design a unified framework to perform well on both tasks, killing two birds with one stone. To this end, we propose a simple, effective, and flexible training paradigm named GLAE, which is two-fold. First, we propose Feature Rearrangement (FR) and Pixel-level Auxiliary learning (PA) for better feature utilization to improve the overall age estimation performance. Second, we propose Adaptive Routing (AR) for selecting the appropriate classifier to improve performance in the tail classes while maintaining the head classes. Moreover, we introduce a new metric, named Class-wise Mean Absolute Error (CMAE), to equally evaluate the performance of all classes. Our GLAE provides a surprising improvement on Morph II, reaching the lowest MAE and CMAE of 1.14 and 1.27 years, respectively. Compared to the previous best method, MAE dropped by up to 34%, which is an unprecedented improvement, and for the first time, MAE is close to 1 year old. Extensive experiments on other age benchmark datasets, including CACD, MIVIA, and Chalearn LAP 2015, also indicate that GLAE outperforms the state-of-the-art approaches significantly.

Abstract:
The increasing demand for immersive experience has greatly promoted the quality assessment research of Light Field Image (LFI). In this paper, we propose an efficient deep discrepancy measuring framework for full-reference light field image quality assessment. The main idea of the proposed framework is to efficiently evaluate the quality degradation of distorted LFIs by measuring the discrepancy between reference and distorted LFI patches. Firstly, a patch generation module is proposed to extract spatio-angular patches and sub-aperture patches from LFIs, which greatly reduces the computational cost. Then, we design a hierarchical discrepancy network based on convolutional neural networks to extract the hierarchical discrepancy features between reference and distorted spatio-angular patches. Besides, the local discrepancy features between reference and distorted sub-aperture patches are extracted as complementary features. After that, the angular-dominant hierarchical discrepancy features and the spatial-dominant local discrepancy features are combined to evaluate the patch quality. Finally, the quality of all patches is pooled to obtain the overall quality of distorted LFIs. To the best of our knowledge, the proposed framework is the first patch-based full-reference light field image quality assessment metric based on deep-learning technology. Experimental results on four representative LFI datasets show that our proposed framework achieves superior performance as well as lower computational complexity compared to other state-of-the-art metrics.

Abstract:
As an important yet challenging task in Earth observation, change detection (CD) is undergoing a technological revolution, given the broadening application of deep learning. Nevertheless, existing deep learning-based CD methods still suffer from two salient issues: 1) incomplete temporal modeling, and 2) space-time coupling. In view of these issues, we propose a more explicit and sophisticated modeling of time and accordingly establish a pair-to-video change detection (P2V-CD) framework. First, a pseudo transition video that carries rich temporal information is constructed from the input image pair, interpreting CD as a problem of video understanding. Then, two decoupled encoders are utilized to spatially and temporally recognize the type of transition, and the encoders are laterally connected for mutual promotion. Furthermore, the deep supervision technique is applied to accelerate the model training. We illustrate experimentally that the P2V-CD method compares favorably to other state-of-the-art CD approaches in terms of both the visual effect and the evaluation metrics, with a moderate model size and relatively lower computational overhead. Extensive feature map visualization experiments demonstrate how our method works beyond making contrasts between bi-temporal images. Source code is available at https://github.com/Bobholamovic/CDLab.

Abstract:
Unifying object detection and re-identification (ReID) into a single network enables faster multi-object tracking (MOT), while this multi-task setting poses challenges for training. In this work, we dissect the joint training of detection and ReID from two dimensions: label assignment and loss function. We find previous works generally overlook them and directly borrow the practices from object detection, inevitably causing inferior performance. Specifically, we identify a qualified label assignment for MOT should: 1) have the assignment cost aware of ReID cost, not just detection cost; 2) provide sufficient positive samples for robust feature learning while avoiding ambiguous positives (i.e., the positives shared by different ground-truth objects). To achieve the above goals, we first propose Identity-aware Label Assignment, which jointly considers the assignment cost of detection and ReID to select positive samples for each instance without ambiguities. Moreover, we advance a novel Discriminative Focal Loss that integrates ReID predictions with Focal Loss to focus the training on the discriminative samples. Finally, we upgrade the strong baseline FairMOT with our techniques and achieve up to 7.0 MOTA / 54.1% IDs improvements on MOT16/17/20 benchmarks under favorable inference speed, which verifies our tailored label assignment and loss function for MOT are superior to those inherited from object detection.

Abstract:
Weakly supervised object detection (WSOD) has received widespread attention since it requires only image-category annotations for detector training. Many advanced approaches solve this problem by a two-phase learning framework, that is, instance mining that classifies generated proposals via multiple instance learning, and instance refinement that iteratively refines bounding boxes using the supervision produced by the preceding stage. In this paper, we observe that the detection performance is usually limited by imprecise supervision, including part domination and untight boxes. To mitigate their adverse effects, we focus on selecting high-quality proposals as the supervision for WSOD. To be specific, for the issue of part domination, we propose bottom-up aggregated attention which incorporates low-level features from shallow layers to improve location representation of top-level features. In this manner, the proposals corresponding to entire objects can get high scores. Its advantage is that it can be flexibly plugged into the WSOD framework since there is no need to attach learnable parameters or learning branches. As regards the problem of untight boxes, we propose a phase-aware loss, which is the first work to measure supervision quality by the loss in the instance mining phase, to highlight correct boxes and suppress untight ones. In this work, we unify the proposed two modules into the framework of online instance classifier refinement. Extensive experiments on the PASCAL VOC and the MS COCO demonstrate that our method can significantly improve the performance of WSOD and achieve the state-of-the-art results. The code is available at https://github.com/Horatio9702/BUAA_PALoss.

Abstract:
The recent success of learning-based image rain and noise removal can be attributed primarily to well-designed neural network architectures and large labeled datasets. However, we discover that current image rain and noise removal methods result in low utilization of images. To alleviate the reliance of deep models on large labeled datasets, we propose the task-driven image rain and noise removal (TRNR) based on a patch analysis strategy. The patch analysis strategy samples image patches with various spatial and statistical properties for training and can increase image utilization. Furthermore, the patch analysis strategy encourages us to introduce the N-frequency-K-shot learning task for the task-driven approach TRNR. TRNR allows neural networks to learn from numerous N-frequency-K-shot learning tasks, rather than from a large amount of data. To verify the effectiveness of TRNR, we build a Multi-Scale Residual Network (MSResNet) for both image rain removal and Gaussian noise removal. Specifically, we train MSResNet for image rain removal and noise removal with a few images (for example, 20.0% train-set of Rain100H). Experimental results demonstrate that TRNR enables MSResNet to learn more effectively when data is scarce. TRNR has also been shown in experiments to improve the performance of existing methods. Furthermore, MSResNet trained with a few images using TRNR outperforms most recent deep learning methods trained data-driven on large labeled datasets. These experimental results have confirmed the effectiveness and superiority of the proposed TRNR. The source code is available on https://github.com/Schizophreni/MSResNet-TRNR.

Abstract:
Defocus blur detection (DBD), which aims to detect out-of-focus or in-focus pixels from a single image, has been widely applied to many vision tasks. To remove the limitation on the abundant pixel-level manual annotations, unsupervised DBD has attracted much attention in recent years. In this paper, a novel deep network named Multi-patch and Multi-scale Contrastive Similarity (M2CS) learning is proposed for unsupervised DBD. Specifically, the predicted DBD mask from a generator is first exploited to re-generate two composite images by transporting the estimated clear and unclear areas from the source image to realistic full-clear and full-blurred images, respectively. To encourage these two composite images to be completely in-focus or out-of-focus, a global similarity discriminator is exploited to measure the similarity of each pair in a contrastive way, through which each two positive samples (two clear images or two blurred images) are enforced to be close while each two negative samples (a clear image and a blurred image) are inversely far. Since the global similarity discriminator only focuses on the blur-level of a whole image and there do exist some fail-detected pixels which only cover a small part of areas, a set of local similarity discriminators are further designed to measure the similarity of image patches in multiple scales. Thanks to this joint global and local strategy, as well as the contrastive similarity learning, the two composite images are more efficiently moved to be all-clear or all-blurred. Experimental results on real-world datasets substantiate the superiority of our proposed method both in quantification and visualization. The source code is released at: https://github.com/jerysaw/M2CS.

Abstract:
Under low-light environment, handheld photography suffers from severe camera shake under long exposure settings. Although existing deblurring algorithms have shown promising performance on well-exposed blurry images, they still cannot cope with low-light snapshots. Sophisticated noise and saturation regions are two dominating challenges in practical low-light deblurring: the former violates the Gaussian or Poisson assumption widely used in most existing algorithms and thus degrades their performance badly, while the latter introduces non-linearity to the classical convolution-based blurring model and makes the deblurring task even challenging. In this work, we propose a novel non-blind deblurring method dubbed image and feature space Wiener deconvolution network (INFWIDE) to tackle these problems systematically. In terms of algorithm design, INFWIDE proposes a two-branch architecture, which explicitly removes noise and hallucinates saturated regions in the image space and suppresses ringing artifacts in the feature space, and integrates the two complementary outputs with a subtle multi-scale fusion network for high quality night photograph deblurring. For effective network training, we design a set of loss functions integrating a forward imaging model and backward reconstruction to form a close-loop regularization to secure good convergence of the deep neural network. Further, to optimize INFWIDE’s applicability in real low-light conditions, a physical-process-based low-light noise model is employed to synthesize realistic noisy night photographs for model training. Taking advantage of the traditional Wiener deconvolution algorithm’s physically driven characteristics and deep neural network’s representation ability, INFWIDE can recover fine details while suppressing the unpleasant artifacts during deblurring. Extensive experiments on synthetic data and real data demonstrate the superior performance of the proposed approach.

Abstract:
Comprehensive understanding of video content requires both spatial and temporal localization. However, there lacks a unified video action localization framework, which hinders the coordinated development of this field. Existing 3D CNN methods take fixed and limited input length at the cost of ignoring temporally long-range cross-modal interaction. On the other hand, despite having large temporal context, existing sequential methods often avoid dense cross-modal interactions for complexity reasons. To address this issue, in this paper, we propose a unified framework which handles the whole video in sequential manner with long-range and dense visual-linguistic interaction in an end-to-end manner. Specifically, a lightweight relevance filtering based transformer (Ref-Transformer) is designed, which is composed of relevance filtering based attention and temporally expanded MLP. The text-relevant spatial regions and temporal clips in video can be efficiently highlighted through the relevance filtering and then propagated among the whole video sequence with the temporally expanded MLP. Extensive experiments on three sub-tasks of referring video action localization, i.e., referring video segmentation, temporal sentence grounding, and spatiotemporal video grounding, show that the proposed framework achieves the state-of-the-art performance in all referring video action localization tasks. The code has been available at https://github.com/TJUMMG/SAW.

Abstract:
Deformable image registration plays a critical role in various tasks of medical image analysis. A successful registration algorithm, either derived from conventional energy optimization or deep networks, requires tremendous efforts from computer experts to well design registration energy or to carefully tune network architectures with respect to medical data available for a given registration task/scenario. This paper proposes an automated learning registration algorithm (AutoReg) that cooperatively optimizes both architectures and their corresponding training objectives, enabling non-computer experts to conveniently find off-the-shelf registration algorithms for various registration scenarios. Specifically, we establish a triple-level framework to embrace the searching for both network architectures and objectives with a cooperating optimization. Extensive experiments on multiple volumetric datasets and various registration scenarios demonstrate that AutoReg can automatically learn an optimal deep registration network for given volumes and achieve state-of-the-art performance. The automatically learned network also improves computational efficiency over the mainstream UNet architecture from 0.558 to 0.270 seconds for a volume pair on the same configuration.

Abstract:
In this paper, we present the first attempt at determining where the achievable rate-distortion (R-D) performance bound in versatile video coding (VVC) intra coding is when considering the mutual dependency in the rate-distortion optimization (RDO) process. In particular, the abundant search space of encoding parameters in VVC intra coding is practically explored with a beam search-based joint rate-distortion optimization (BSJRDO) scheme. As such, the partitioning, prediction and transform decisions are jointly optimized across different coding units (CUs) with a customized search subset instead of the full space. To make the beam search process implementation-friendly for VVC, the dependencies among the CUs are truncated at different depths. To facilitate finer computational scalability, the beam size is flexibly adjusted based on the characteristics of the CUs, such that the operational points that satisfy different complexity demands for diverse applications can be practically obtained. The proposed BSJRDO approach, which fully conforms to the VVC decoding syntax, can serve as both the way toward the optimal RDO bound and a practical performance-boosting solution. BSJRDO is further implemented on a VVC coding platform (VVC Test model (VTM) 12.0), and extensive experiments show that BSJRDO can achieve 1.30% and 3.22% bit rate savings compared to the VTM anchor under the common test condition and low-bit-rate coding scenarios, respectively. Moreover, the performance gain can also be flexibly customized with different computational overheads.

Abstract:
The combination of different sensory information to predict upcoming situations is an innate capability of intelligent beings. Consequently, various studies in the Artificial Intelligence field are currently being conducted to transfer this ability to artificial systems. Autonomous vehicles can particularly benefit from the combination of multi-modal information from the different sensors of the agent. This paper proposes a method for video-frame prediction that leverages odometric data. It can then serve as a basis for anomaly detection. A Dynamic Bayesian Network framework is adopted, combined with the use of Deep Learning methods to learn an appropriate latent space. First, a Markov Jump Particle Filter is built over the odometric data. This odometry model comprises a set of clusters. As a second step, the video model is learned. It is composed of a Kalman Variational Autoencoder modified to leverage the odometry clusters for focusing its learning attention on features related to the dynamic tasks that the vehicle is performing. We call the obtained overall model Cluster-Guided Kalman Variational Autoencoder. Evaluation is conducted using data from a car moving in a closed environment and leveraging a part of the University of Alcalá DriveSet dataset, where several drivers move in a normal and drowsy way along a secondary road.

Abstract:
In the past few years, deep learning-based methods have shown commendable performance for hyperspectral image (HSI) classification. Many works focus on designing independent spectral and spatial branches and then fusing the output features from two branches for category prediction. In this way, the correlation that exists between spectral and spatial information is not completely explored, and spectral information extracted from one branch is always not sufficient. Some studies also try to directly extract spectral-spatial features using 3D convolutions but are accompanied by the severe over-smoothing phenomenon and poor representation ability of spectral signatures. Unlike the above-mentioned approaches, in this paper, we propose a novel online spectral information compensation network (OSICN) for HSI classification, which consists of a candidate spectral vector mechanism, progressive filling process, and multi-branch network. To the best of our knowledge, this paper is the first to online supplement spectral information into the network when spatial features are extracted. The proposed OSICN makes the spectral information participate in network learning in advance to guide spatial information extraction, which truly processes spectral and spatial features in HSI as a whole. Accordingly, OSICN is more reasonable and more effective for complex HSI data. Experimental results on three benchmark datasets demonstrate that the proposed approach has more outstanding classification performance compared with the state-of-the-art methods, even with a limited number of training samples.

Abstract:
The limited depth of field of optical lenses, makes multi-focus image fusion (MFIF) algorithms of vital importance. Lately, Convolutional Neural Networks (CNN) have been widely adopted in MFIF methods, however their predictions mostly lack structure and are limited by the size of the receptive field. Moreover, since images have noise due to various sources, the development of MFIF methods robust to image noise is required. A novel robust to noise Convolutional Neural Network-based Conditional Random Field (mf-CNNCRF) model is introduced. The model takes advantage of the powerful mapping between input and output of CNN networks and the long range interactions of the CRF models in order to reach structured inference. Rich priors for both unary and smoothness terms are learned by training CNN networks. The \alpha -expansion graph-cut algorithm is used to reach structured inference for MFIF. A new dataset, which includes clean and noisy image pairs, is introduced and is used to train the networks of both CRF terms. A low-light MFIF dataset is also developed to demonstrate real-life noise introduced by the camera sensor. Qualitative and quantitative evaluation prove that mf-CNNCRF outperforms state-of-the-art MFIF methods for clean and noisy input images, while being more robust to different noise types without requiring prior knowledge of noise.

Abstract:
Due to the adverse effect of quality caused by different social media and arbitrary languages in natural scenes, detecting text from social media images and transferring its style is challenging. This paper presents a novel end-to-end model for text detection and text style transfer in social media images. The key notion of the proposed work is to find dominant information, such as fine details in the degraded images (social media images), and then restore the structure of character information. Therefore, we first introduce a novel idea of extracting gradients from the frequency domain of the input image to reduce the adverse effect of different social media, which outputs text candidate points. The text candidates are further connected into components and used for text detection via a UNet++ like network with an EfficientNet backbone (EffiUNet++). Then, to deal with the style transfer issue, we devise a generative model, which comprises a target encoder and style parameter networks (TESP-Net) to generate the target characters by leveraging the recognition results from the first stage. Specifically, a series of residual mapping and a position attention module are devised to improve the shape and structure of generated characters. The whole model is trained end-to-end so as to optimize the performance. Experiments on our social media dataset, benchmark datasets of natural scene text detection and text style transfer show that the proposed model outperforms the existing text detection and style transfer methods in multilingual and cross-language scenario.

Abstract:
Semantic segmentation of remote sensing images aims to achieve pixel-level semantic category assignment for input images. This task has achieved significant advances with the rapid development of deep neural network. Most current methods mainly focus on effectively fusing the low-level spatial details and high-level semantic cues. Other methods also propose to incorporate the boundary guidance to obtain boundary preserving segmentation. However, current methods treat the multi-level feature fusion and the boundary guidance as two separate tasks, resulting in sub-optimal solutions. Moreover, due to the large inter-class difference and small intra-class consistency within remote sensing images, current methods often fail to accurately aggregate the long-range contextual cues. These critical issues make current methods fail to achieve satisfactory segmentation predictions, which severely hinder downstream applications. To this end, we first propose a novel boundary guided multi-level feature fusion module to seamlessly incorporate the boundary guidance into the multi-level feature fusion operations. Meanwhile, in order to further enforce the boundary guidance effectively, we employ a geometric-similarity-based boundary loss function. In this way, under the explicit guidance of boundary constraint, the multi-level features are effectively combined. In addition, a channel-wise correlation guided spatial-semantic context aggregation module is presented to effectively aggregate the contextual cues. In this way, subtle but meaningful contextual cues about pixel-wise spatial context and channel-wise semantic correlation are effectively aggregated, leading to spatial-semantic context aggregation. Extensive qualitative and quantitative experimental results on ISPRS Vaihingen and GaoFen-2 datasets demonstrate the effectiveness of the proposed method.

Abstract:
As a crucial application in privacy protection, scene text removal (STR) has received amounts of attention in recent years. However, existing approaches coarsely erasing texts from images ignore two important properties: the background texture integrity (BI) and the text erasure exhaustivity (EE). These two properties directly determine the erasure performance, and how to maintain them in a single network is the core problem for STR task. In this paper, we attribute the lack of BI and EE properties to the implicit erasure guidance and imbalanced multi-stage erasure respectively. To improve these two properties, we propose a new ProgrEssively Region-based scene Text eraser (PERT). There are three key contributions in our study. First, a novel explicit erasure guidance is proposed to enhance the BI property. Different from implicit erasure guidance modifying all the pixels in the entire image, our explicit one accurately performs stroke-level modification with only bounding-box level annotations. Second, a new balanced multi-stage erasure is constructed to improve the EE property. By balancing the learning difficulty and network structure among progressive stages, each stage takes an equal step towards the text-erased image to ensure the erasure exhaustivity. Third, we propose two new evaluation metrics called BI-metric and EE-metric, which make up the shortcomings of current evaluation tools in analyzing BI and EE properties. Compared with previous methods, PERT outperforms them by a large margin in both BI-metric ( \uparrow 6.13 %) and EE-metric ( \uparrow 1.9 %), obtaining SOTA results with high speed (71 FPS) and at least 25% lower parameter complexity. Code will be available at https://github.com/wangyuxin87/PERT.