TMM2023

Abstract:
Multi-Panda Tracking (MPT) is a video-based tracking task for panda individuals, which is conducive to the observation and measurement of distribution and status of pandas. Different from tracking general objects such as pedestrians and vehicles, MPT is extremely challenging due to the indistinguishable appearances and diversified postures of pandas. In this case, existing tracking methods cannot appropriately tackle with the excessive occlusion between different panda individuals, hence suffering from identity switch, missing and inaccurate detections. To address these problems, we propose a simple yet effective MPT framework in the tracking-by-detection paradigm, which is benefited both from a short-term prediction filtering module and a discriminative feature learning network. In particular, the short-term prediction filtering module introduces similarity learning to enhance the temporal consistency among detections, which is capable of supplementing the missing detections and discarding false positive detections. Besides, the discriminative feature learning network leverages a two-branch network to learn both local and global discriminative features, so as to distinguish different panda individuals with a very similar appearance with a subtle difference. To evaluate the proposed method, we annotate a large-scale MPT dataset, named PANDA2021, which is particularly challenging due to the similar appearance and dramatic occlusion between panda individuals. Experiments on PANDA2021 demonstrate that the proposed MPT method significantly outperforms the competing methods. Moreover, experimental results on pedestrian tracking dataset MOT16 further demonstrate that the proposed MPT method achieves comparative performance with competing methods.

Abstract:
In this paper, we propose a radio-assisted human detection framework by incorporating radio information into the state-of-the-art detection methods, including anchor-based one-stage detectors and two-stage detectors. We extract the radio localization and identifier information from the radio signals to assist the human detection, due to which the problem of false positives and false negatives can be greatly alleviated. For both detectors, we use the confidence score revision based on the radio localization to improve the detection performance. For two-stage detection methods, we propose to utilize the region proposals generated from radio localization rather than relying on region proposal network (RPN). Moreover, with the radio identifier information, a non-max suppression method with the radio localization constraint has also been proposed to further suppress the false detections and reduce miss detections. Experiments on the simulative Microsoft COCO dataset and Caltech pedestrian datasets show that the mean average precision (mAP) and the miss rate of the state-of-the-art detection methods can be improved with the aid of radio information. Finally, we conduct experiments in real-world scenarios to demonstrate the feasibility of our proposed method in practice.

Abstract:
In existing generative adversarial networks (standard GAN and its variants), the discriminator is trained for recognizing the real data as positive while the generated data as negative. This kind of positive-negative classification criterion ignores the fact that the discriminator is a non-objective evaluator, which means that the image quality evaluated by the discriminator may fluctuate during the whole training progress. Considering this fact, we propose a novel GAN framework called Discriminator-Quality Evaluation GAN (DQE-GAN) by using the discriminator outputs to evaluate image quality. By dynamically classifying images into high discriminator-quality and low discriminator-quality samples, every adversarial iteration step can be more reasonable and objective. The convergence of DQE-GAN framework can be theoretically proved. Through extensive experiments, we demonstrate DQE-GANs’ ability of achieving better generated images faster and more stable.

Abstract:
Deep clustering has attracted plentiful attention in various domains owning to the superior performance. However, the previous deep clustering methods are guided by pre-specified clustering strategies that lack sustained explorations of data structures, degrading recognition of intrinsic patterns hidden in data. To address this challenge, deep reinforcement clustering (DRC) is proposed to learn an adaptive partition policy for pattern mining, which can fully explore structure knowledge of data in an adaptive manner. DRC is defined as a Markov decision process of data partitions, which chooses the optimal cluster prototype for data via maximizing the cumulative reward in state transition of environment. To implement the definition, a Bernoulli action prototype is devised to capture decision distributions in the transition of states, where the heavy-tailed Cauchy distribution precisely measures the structure divergences of data. Furthermore, a reward maximizing policy is designed to guide sustained explorations of data structures, which ensures intra-cluster compactness and inter-cluster separation of data partitions. Finally, extensive experiments are conducted on eight benchmark datasets, and the results demonstrate that DRC outperforms the state-of-the-art baseline methods.

Abstract:
Intent perception is a novel task that aims to understand the intention of images, regular classification methods usually perform unsatisfactorily on intent perception due to the semantic ambiguity problem, i.e. the intra-class variety problem in which images of the same intent class may contain objects of different semantic categories and the inter-class confusion problem in which images of different intent classes may contain objects of similar semantic categories. To address this problem, this paper introduces prototype learning into the intent perception and proposes a unified framework named PIP-Net to reduce the influence of semantic ambiguity. Specifically, for each intent class, we first filter semantic ambiguity samples which are far away from the cluster center. Then we use features of the filtered samples to generate prototypes via clustering algorithm. Besides, we enhance the diversity between prototypes of different classes to better handle the inter-class confusion problem. To update the prototypes in the training process, we introduce a global matching algorithm to holistically match each feature with class prototypes, and use the momentum update strategy to stably update prototypes. Experimental results on the Intentonomy dataset demonstrate that our method can consistently outperform the traditional classification paradigm in multiple baseline models, and verify the effectiveness of our proposed prototype learning paradigm in addressing the intent perception problem. Our proposed PIP-Net achieves a new state-of-the-art performance on Intentonomy, including Macro F1 score of 31.57% and averaging F1 score of 41.85%.

Abstract:
Homogeneous regions, which are smooth areas that lack blur clues to discriminate if they are focused or non-focused. Therefore, they bring a great challenge to achieve high accurate multi-focus image fusion (MFIF). Fortunately, we observe that depth maps are highly related to focus and defocus, containing a preponderance of discriminative power to locate homogeneous regions. This offers the potential to provide additional depth cues to assist MFIF task. Taking depth cues into consideration, in this paper, we propose a new depth-distilled multi-focus image fusion framework, namely D2MFIF. In D2MFIF, depth-distilled model (DDM) is designed for adaptively transferring the depth knowledge into MFIF task, gradually improving MFIF performance. Moreover, multi-level fusion mechanism is designed to integrate multi-level decision maps from intermediate outputs for improving the final prediction. Visually and quantitatively experimental results demonstrate the superiority of our method over several state-of-the-art methods.

Abstract:
Visible-infrared person re-identification (VI-ReID) aims to match person images between the visible and near-infrared modalities. Previous VI-ReID methods are based on holistic pedestrian images and achieve excellent performance. However, in real-world scenarios, images captured by visible and near-infrared cameras usually contain occlusions. The performance of these methods degrades significantly due to the loss of information of discriminative features from the occlusion of the images. We define visible-infrared person re-identification in this occlusion scene as Occluded VI-ReID, where only partial content information of pedestrian images can be used to match images of different modalities from different cameras. In this paper, we propose a matching framework for occlusion scenes, which contains a local feature enhance module (LFEM) and a modality information fusion module (MIFM). LFEM adopts Transformer to learn features of each modality, and adjusts the importance of patches to enhance the representation ability of local features of the non-occluded areas. MIFM utilizes a co-attention mechanism to infer the correlation between each image for reducing the difference between modalities. We construct two occluded VI-ReID datasets, namely Occluded-SYSU-MM01 and Occluded-RegDB datasets. Our approach outperforms existing state-of-the-art methods on two occlusion datasets, while remains top performance on two holistic datasets.

Abstract:
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition. We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction, and we use self-supervised discriminating mechanisms. As Second-order Pooling (SoP) is popular in image recognition, we employ its basic element-wise variant in our pipeline. The goal of multi-level feature design is to extract feature representations at different layer-wise levels of CNN, realizing several levels of visual abstraction to achieve robust few-shot learning. As SoP can handle convolutional feature maps of varying spatial sizes, we also introduce image inputs at multiple spatial scales into MlSo. To exploit the discriminative information from multi-level and multi-scale features, we develop a Feature Matching (FM) module that reweights their respective branches. We also introduce a self-supervised step, which is a discriminator of the spatial level and the scale of abstraction. Our pipeline is trained in an end-to-end manner. With a simple architecture, we demonstrate respectable results on standard datasets such as Omniglot, mini–ImageNet, tiered–ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini–MIT.

Abstract:
This paper demonstrates human synthesis based on the Radio Frequency (RF) signals, which leverages the fact that RF signals can record human movements with the signal reflections off the human body. Different from existing RF sensing works that can only perceive humans roughly, this paper aims to generate fine-grained optical human images by introducing a novel cross-modal RFGAN model. Specifically, we first build a radio system equipped with horizontal and vertical antenna arrays to transceive RF signals. Since the reflected RF signals are processed as obscure signal projection heatmaps on the horizontal and vertical planes, we design a RF-Extractor with RNN in RFGAN for RF heatmap encoding and combining to obtain the human activity information. Then we inject the information extracted by the RF-Extractor and RNN as the condition into GAN using the proposed RF-based adaptive normalizations. Finally, we train the whole model in an end-to-end manner. To evaluate our proposed model, we create two cross-modal datasets (RF-Walk & RF-Activity) that contain thousands of optical human activity frames and corresponding RF signals. Experimental results show that the RFGAN can generate target human activity frames using RF signals. To the best of our knowledge, this is the first work to generate optical images based on RF signals.

Abstract:
Stereo cameras are now commonly used in more and more devices. Nevertheless, visually unpleasant images captured under low-light conditions hinder their practical application. As an initial attempt at low-light stereo image enhancement, we propose a novel Dual-View Enhancement Network (DVENet) based on the Retinex theory, which consists of two stages. The first stage estimates an illumination map to obtain a coarse enhancement result, which boosts the correlation of two views, while the second stage recovers details by integrating the information from two views to achieve fine image quality improvement with the guidance of the illumination map. To fully utilize the dual-view correlation, we further design a wavelet-based view transfer module to efficiently carry out multi-scale detail recovery. Then, we design an illumination-aware attention fusion module to exploit the complementarity between the fused features from two views and the single-view features. Experiments on both synthetic and real-world stereo datasets demonstrate the superiority of our proposed method over existing solutions. The code and model are publicly available at: https://github.com/KevinJ-Huang/Stereo-Low-Light.

Abstract:
Downsampled sparse point clouds are beneficial for data transmission and storage, but they are detrimental for semantic tasks due to information loss. In this paper, we examine an upsampling methodology that significantly reconstructs sparse clouds’ semantic representations. Specifically, we propose a novel semantic point cloud upsampling (SPU) framework for sparse point cloud classification. An SPU consists of two networks, i.e. an upsampling network and a classification network. They are skillfully unified to intensify semantic representations acting on the upsampling process. In the upsampling network, we first propose a novel graph aggregation convolution to construct hierarchical relations on sparse point clouds. To enhance stability and diversity during point upsampling, we then combine point shuffling and pre-interpolation technologies to build an enhanced upsampling module. Furthermore, we adopt the semantic prior information provided by a sparse point cloud to enhance its upsampling quality. The prior information is applied to an attention mechanism that can highlight key positions of the point cloud. We investigate different loss functions and conduct experiments on classical deep point networks, which effectively demonstrate the promising performance of our framework.

Affiliations: School of Computer Science, Guangdong University of Petrochemical Technology, Maoming, China; Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland; School of Software Engineering, Jinling Institute of Technology, Nanjing, China; School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; Department of Information Engineering and Computer Science (DISI), University of Trento, Trento, Italy; Department of Computer Science, Illinois Institute of Technology, Chicago, IL, USA

Abstract:
In this paper, we tackle the problem of synthesizing a ground-view panorama image conditioned on a top-view aerial image, which is a challenging problem due to the large gap between the two image domains with different view-points. Instead of learning cross-view mapping in a feedforward pass, we propose a novel adversarial feedback GAN framework named PanoGAN with two key components: an adversarial feedback module and a dual branch discrimination strategy. First, the aerial image is fed into the generator to produce a target panorama image and its associated segmentation map in favor of model training with layout semantics. Second, the feature responses of the discriminator encoded by our adversarial feedback module are fed back to the generator to refine the intermediate representations, so that the generation performance is continually improved through an iterative generation process. Third, to pursue high-fidelity and semantic consistency of the generated panorama image, we propose a pixel-segmentation alignment mechanism under the dual branch discrimiantion strategy to facilitate cooperation between the generator and the discriminator. Extensive experimental results on two challenging cross-view image datasets show that PanoGAN enables high-quality panorama image generation with more convincing details than state-of-the-art approaches. The source code and trained models are available at https://github.com/sswuai/PanoGAN.

Abstract:
We focus on the music generation conditional on human emotions, specifically the positive and negative emotions. There is no existing large-scale music datasets with the annotation of human emotion labels. It is thus not intuitive how to generate music conditioned on emotion labels. In this paper, we propose an annotation-free method to build a new dataset where each sample is a triplet of lyric, melody and emotion label (without requiring any labours). Specifically, we first train the automated emotion recognition model using the BERT (pre-trained on GoEmotions dataset) on Edmonds Dance dataset. We use it to automatically “label” the music with the emotion labels recognized from the lyrics. We then train the encoder-decoder based model to generate emotional music on that dataset, and call our overall method as Emotional Lyric and Melody Generator (ELMG). The framework of ELMG is consisted of three modules: 1) an encoder-decoder model trained end-to-end to generate lyric and melody; 2) a music emotion classifier trained on labeled data (our proposed dataset); and 3) a modified beam search algorithm that guides the music generation process by incorporating the music emotion classifier. We conduct objective and subjective evaluations on the generated music pieces, and our results show that ELMG is capable of generating tuneful lyric and melody with specified human emotions.

Abstract:
Convolutional neural networks (CNNs) have shown attractive performance for stereo matching. However, spatially shared convolution weights of CNN-based methods usually face a dilemma that the convolution weights suitable for aggregating contextual information in smooth regions often blur local matching details of textured regions and vice versa. This paper tries to find a way out of the dilemma via a novel region separable stereo matching (RSSM) method, which is universally applicable to CNN stereo models based on 4D cost volumes and can greatly improve the accuracy and efficiency of existing models. The key idea of our method is to automatically group image pixels into regions according to the gradients, and then construct and process the respective cost volume of each region separately. To perform cost aggregation, we propose a two-stage network consisted of regional grouping aggregation (RGA) and regional fusion aggregation (RFA). In RGA, convolutions are grouped in channel-wise, and each group of convolutions learn dedicated weights for the corresponding region via regional supervision. Through RGA, each group of convolutions can extract the most representative features from the corresponding region. In RFA, we combine matching clues of all convolution groups from RGA to output the final prediction map. We further extend the idea of regional grouping to feature extraction and modify the skip connection in aggregation networks to better adapt our method to stereo matching models. Experimental results on five public datasets show that our method can significantly improve several state-of-the-art 3D CNN based stereo models.

Abstract:
Reflections often degrade the quality of images by obstructing the background scenes. This is not desirable for everyday users, and it negatively impacts the performance of multimedia applications that process images with reflections. Most current methods for removing reflections utilize supervised learning models. These models require an extensive number of image pairs of the same scenes with and without reflections to perform well. However, collecting such image pairs is challenging and costly. Thus, most current supervised models are trained on small datasets that cannot cover the numerous possibilities of real-life images with reflections. In this paper, we propose an unsupervised method for single-image reflection removal. Instead of learning from a large dataset, we optimize the parameters of two cross-coupled deep convolutional neural networks on a target image to generate two exclusive background and reflection layers. In particular, we design a network model that embeds semantic features extracted from the input image and utilizes these features in the separation of the background layer from the reflection layer. We show through objective and subjective studies on benchmark datasets that the proposed method substantially outperforms current methods in the literature. The proposed method does not require large datasets for training, removes reflections from single images, and does not impose impractical constraints on the input images.

Abstract:
Camouflage is a common visual phenomenon, which refers to hiding the foreground objects into the background images, making them briefly invisible to the human eye. Previous work has typically been implemented by an iterative optimization process. However, these methods struggle in 1) efficiently generating camouflage images using foreground and background with flexible structure; 2) camouflaging foreground objects to regions with multiple appearances (e.g. the junction of the vegetation and the mountains), which limit their practical application. To address these problems, this paper proposes a novel Location-free Camouflage Generation Network (LCG-Net) that fuse high-level features of foreground and background image, and generate result by one inference. Specifically, a Position-aligned Structure Fusion (PSF) module is devised to guide structure feature fusion based on the point-to-point structure similarity of foreground and background, and introduce local appearance features point-by-point. To retain the necessary identifiable features, a new immerse loss is adopted under our pipeline, while a background patch appearance loss is utilized to ensure that the hidden objects look continuous and natural at regions with multiple appearances. Experiments show that our method has results as satisfactory as state-of-the-art in the single-appearance regions and are less likely to be completely invisible, but far exceed the quality of the state-of-the-art in the multi-appearance regions. Moreover, our method is hundreds of times faster than previous methods. Benefitting from the unique advantages of our method, we provide some downstream applications for camouflage generation, which show its potential. The related code and dataset will be released at https://github.com/Tale17/LCG-Net.

Abstract:
Given an untrimmed video and a language query, Video Temporal Grounding (VTG) aims to locate the time interval in the video semantically relevant to the query. Existing fully-supervised VTG methods require accurate annotations of temporal boundary, which is time-consuming and expensive to obtain. On the other hand, weakly-supervised VTG methods where only paired videos and queries are available during training lag far behind the fully-supervised ones. In this paper, we introduce point supervision to narrow the performance gap with affordable annotating cost and propose a novel method dubbed Point-Supervised Video Temporal Grounding (PS-VTG). Specifically, an attention-based grounding network is first employed to obtain a language activation sequence (LAS). Then pseudo segment-level label is generated based on the LAS and the given point supervision to assist the training process. In addition, multi-level distribution calibration and cross-modal contrast are framed to obtain discriminative feature representations and precisely highlight the language-relevant video segments. Experiments on three benchmarks demonstrate that our method trained with point supervision can significantly outperform weakly-supervised approaches and achieve comparable performance with fully-supervised ones.

Abstract:
Image captioning is a challenging task that generates a natural language description based on the visual understanding of the given image. Significant region representation is a milestone in image captioning. Despite the great success of existing region-based works, they only focus on the salient objects and encode these objects independently, still plagued by the lack of global contextual information and visual relationships. In fact, the global contextual information and structured visual relationships are exactly the merits of traditional grid features and emerging scene graph features. In this paper, we present a Triple-Steam Feature Fusion Network (TSFNet) to leverage the complementary advantages of the grid, region, and scene graph triple-steam visual representations in image captioning. Concretely, in our TSFNet, a novel Dual-level Attention (DA) mechanism is proposed to simultaneously explore visual intrinsic properties and word-related attributes uniformly of different features. Then attention enhanced features of different modalities are mapped into a joint representation to guide the caption generation. Moreover, we design a new global-aware decoder, which leverages the concatenated representation of triple-steam features and the joint attention representation to obtain global visual guidance information, further refine the complex multimodal reasoning. To verify the effectiveness of our feature fusion model, we perform extensive experiments on the highly competitive MSCOCO dataset to evaluate the model quantitatively and qualitatively. The results illustrate that the proposed framework outperforms many state-of-the-art image captioning approaches in various evaluation metrics, and generates more accurate and abundant captions.

Abstract:
Fine-grained image retrieval has been extensively explored in a zero-shot manner. A deep model is trained on the seen part and then evaluated the generalization performance on the unseen part. However, this setting is infeasible for many real-world applications since (1) the retrieval dataset can be non-fixed so that new data are added constantly, and (2) data samples of the seen categories are also common in practice and are important for evaluation. In this paper, we explore lifelong fine-grained image retrieval (LFGIR), which learns continuously on a sequence of new tasks with data from different datasets. We first use knowledge distillation to minimize catastrophic forgetting on old tasks. Training continuously on different datasets causes large domain shifts between the old and new tasks while image retrieval is sensitive to even small shifts in the features. This tends to weaken the effectiveness of knowledge distillation by the frozen teacher. To mitigate the impact of domain shifts, we use the network inversion method to generate images of the old tasks. In addition, we design an on-the-fly teacher which transfers knowledge captured on a new task to the student to improve better generalization performance, thereby achieving a better balance between old and new tasks in the end. We name the whole framework as Dual Knowledge Distillation (DKD), whose efficacy is demonstrated by extensive experimental results on sequential tasks including seven datasets.

Abstract:
Deep-learning-based watermarking technique is being extensively studied. Most existing approaches adopt a similar encoder-driven scheme which we name END (Encoder-NoiseLayer-Decoder) architecture. In this paper, we revamp the architecture and creatively design a decoder-driven watermarking network dubbed De-END which greatly outperforms the existing END-based methods. The motivation for designing De-END originated from the potential drawback we discovered in END architecture: The encoder may embed redundant features that are not necessary for decoding, limiting the performance of the whole network. We conducted a detailed analysis and found that such limitations are caused by unsatisfactory coupling between the encoder and decoder in END. De-END addresses such drawbacks by adopting a Decoder -Encoder-Noiselayer-Decoder architecture. In De-END, the host image is firstly processed by the decoder to generate a latent feature map instead of being directly fed into the encoder. This latent feature map is concatenated to the original watermark message and then processed by the encoder. This change in design is crucial as it makes the feature of encoder and decoder directly shared thus the encoder and decoder are better coupled. We conducted extensive experiments and the results show that this framework outperforms the existing state-of-the-art (SOTA) END-based deep learning watermarking both in visual quality and robustness. On the premise of the same decoder structure, the visual quality (measured by PSNR) of De-END improves by 1.6dB (45.16dB to 46.84dB), and extraction accuracy after JPEG compression (QF=50) distortion outperforms more than 4% (94.9% to 99.1%).

Abstract:
Automated surveillance is widely opted for appli- cations such as traffic monitoring, vehicle identification, etc. But, various weather degradation factors such as rain and snow streaks, along with atmospheric veil severely affect the perceptual quality of an image, eventually affecting the performance of these applications. There exist weather specific (rain, haze, snow, etc.) methods focusing on respective restoration task. As image restoration is a preprocessing step for high level surveillance applications, it is practically inapplicable to have different architectures for different weather restoration. In this paper, we propose a lightweight unified network, having 1.1 M parameters (1/40th and 1/6th of the existing rain with veil removal, and snow with veil removal methods respectively) for removal of rain and snow along with the veiling effect present in the images. In this network, we propose two parallel streams to handle the degradations and restoration: First, degradation removal stream (DRS) focuses mainly on removing randomly repeating degradations i.e., rain and snow streaks, through the proposed adaptive multi-scale feature sharing block (AMFSB) and stage-wise subtractive block (SSB). Second, feature corrector stream (FCS) mainly focuses on refining the partial outputs of the first stream, reducing the veiling effect and acts supplementary to the first stream. Finally, we leverage contrastive regularization for better convergence of the proposed network. Substantial experiments on synthetic as well as real-world images, along with extensive ablation studies, demonstrate that the proposed method performs competitively with the existing methods for multi-weather image restoration. The code is available at: https://github.com/AshutoshKulkarni4998/UVRNet.

Abstract:
Generative adversarial networks (GANs) have demonstrated superior performances in image generation. In recent years, various improvements of network structure and learning theory related to GANs have undergone numerous advancement. Among these improvement techniques, the asymmetric training on the generator and discriminator networks has been widely adopted. For example, the batch normalization is used in generator while the spectral normalization is used in discriminator, or using different learning rates for the generator and discriminator. However, the asymmetric training on the real and generated samples has not been taken into consideration till now. In this paper, we proposed a novel asymmetric training-based RealnessGAN (ATRGAN) which applies the idea of asymmetric training on both samples and networks. Specifically, the asymmetric training on samples refers to performing the differential learning on the real and generated samples by controlling the information entropies of real and fake anchor distributions. The asymmetric training on networks is realized via the sampling transmission G2D, which abandons the commonly used independent random sampling. With the help of G2D, the discriminator can obtain a dominant training position than the generator, so as to ensure that the discriminator can guide the generator more effectively during training. In addition, we proposed the floating anchor distribution technique and constructed the objective function of generator for ATRGAN. Through comparative experiments, we demonstrated ATRGAN's ability of achieving better generation performance than various SOTA GANs on CIFAR-10, CAT, and CelebA-HQ datasets.

Abstract:
Irregular-shaped texts bring challenges to Scene Text Detection (STD). Although existing regression-based approaches achieve comparable performances, they fail to cover some highly curved ribbon-like text lines. Inspired by morphology, we found that the leaf vein can easily cover various geometries. Specifically, lateral and thin veins are emitted to margin along main vein gradually with the leaf growth. This process can decompose a concave object into consecutive convex regions, which are easier to fit. Hence, the leaf vein is suitable for representing highly curved texts. Considering the aforementioned advantage, we design a leaf vein-based text representation method (LVT), where text contour is treated as leaf margin and represented through main, lateral, and thin veins. We further construct a detection framework based on LVT, namely LeafText. In the text reconstruction stage, LeafText simulates the leaf growth process to rebuild text contours. It grows main veins in Cartesian coordinates to locate texts roughly at first. Then, lateral and thin veins are generated along the main vein growth direction in polar coordinates. They are responsible for generating the coarse contour and refining it, respectively. Meanwhile, Multi-Oriented Smoother (MOS) is designed to smooth the main vein for ensuring reliable growth directions of lateral and thin veins. Additionally, a global incentive loss is proposed to enhance the predictions of lateral and thin veins. Ablation experiments demonstrate LVT can fit irregular-shaped texts precisely and verify the effectiveness of MOS and global incentive loss. Comparisons show that LeafText is superior to existing state-of-the-art (SOTA) methods on MSRA-TD500, CTW1500, Total-Text, and ICDAR2015 datasets.

Abstract:
Deep neural networks (DNNs) are vulnerable to adversarial attacks which can fool the classifiers by adding small perturbations to the original example. The added perturbations in most existing attacks are mainly determined by the gradient of the loss function with respect to the current example. In this paper, a new average gradient-based adversarial attack is proposed. In our proposed method, via utilizing the gradient of each iteration in the past, a dynamic set of adversarial examples is constructed first in each iteration. Then, according to the gradient of the loss function with respect to all the examples in the constructed dynamic set and the current adversarial example, the average gradient can be calculated, which is used to determine the added perturbations. Different from the existing adversarial attacks, the proposed average gradient-based attack optimizes the added perturbations through a dynamic set of adversarial examples, where the size of the dynamic set increases with the number of iterations. Our proposed method possesses good extensibility and can be integrated into most existing gradient-based attacks. Extensive experiments demonstrate that, compared with the state-of-the-art gradient-based adversarial attacks, the proposed attack can achieve higher attack success rates and exhibit better transferability, which is helpful to evaluate the robustness of the network and the effectiveness of the defense method.

Abstract:
Violence detection in videos is very promising in practical applications due to the emergence of massive videos in recent years. Most previous works define violence detection as a simple video classification task and use the single modality of small-scale datasets, e.g., visual signal. However, such solutions are undersupplied. To mitigate this problem, we study weakly supervised violence detection on the large-scale audio-visual violence data, and first introduce two complementary tasks, i.e., coarse-grained violent frame detection and fine-grained violent event detection, to advance the simple violence video classification to frame-level violent event localization, which aims to accurately locate the violent events on untrimmed videos. We then propose a novel network that takes as input audio-visual data and contains three parallel branches to capture different relationships among video snippets and further integrate features, where similarity branch and proximity branch capture long-range dependencies using similarity prior and proximity prior, respectively, and score branch dynamically captures the closeness of predicted score. In both coarse-grained and fine-grained tasks, our approach outperforms other state-of-the-art approaches on two public datasets. Moreover, experiment results also show the positive effect of audio-visual input and relationship modeling.

Abstract:
Multi-task pixel-level learning, which aims to exploit the inter-task interactions to improve the learning of each task, is an important but challenging issue in visual perception and multimedia applications. Measuring the inter-task correlation and intra-task specificity, we propose a tube-embedded transformer (TET) framework for robust multi-task pixel prediction. To facilitate inter-task interactions, we aggregate and project all tasks into a shared tube pool to generate the latent multi-task representation during the coarse-to-fine decoding stages. The resulting task-tube interactions replace the two-by-two task-task interactions to reduce the model complexity significantly. In addition, we introduce the transformer mechanism to adaptively transfer tube features to the target task. Concretely, on the one hand, multi-task features aggregate in the tube to generate the shared feature representation bases; on the other hand, based on the task-tube association and complementarity, the tube outputs the query entry and the weighting coefficients of the target task. Experimentally, on the joint learning of semantic segmentation, depth estimation, and surface normal estimation, the comparison experiments show the superiority of the TET multi-task learning method over other state-of-the-art approaches, and the ablation experiments verify the effectiveness of the TET mechanism.

Abstract:
Image stitching usually relies on spatial transformations to perform the overlap alignment and distortion mitigation. This paper presents a manifold optimization method to seek these transformations. The purpose is not to present a new formulation of image stitching, as the proposed method uses common transformations such as homography to align feature correspondences in the overlap and similarity transformations to preserve the shape. Instead, the proposed method is based on a new treatment of these transformations as elements of a prescribed matrix manifold. Its advantage lies in its more effective and efficient optimization in the manifold domain. Specifically, spatially varying homographies are computed by an efficient second-order minimization (ESM) of the geometric error of aligning feature correspondences, but with their intrinsic manifold parameterization. To mitigate the distortion, the interpolation between homography and similarity transformation is performed on a general matrix manifold. These on-manifold operations improve the stitching quality with fewer ghosting and distortion artifacts. The experiments show our manifold optimization for image stitching outperforms other methods.

Abstract:
Learning representations for multimedia content is critical for multimedia recommendation. Current representation learning methods roughly fall into two groups: (1) using the historical interactions to create ID embeddings of users and items, and (2) treating multi-modal data as the side information of items to enrich their ID embeddings. Each user-item interaction offers the supervisory signal to optimize the representation learning by the traditional supervised learning paradigm. Due to the overlook of the multi-modal patterns (e.g., co-occurrence of visual, acoustic, textual features in micro-videos a user saw before, and her behavioral features) hidden in the data, these methods are insufficient to create powerful representations and obtain satisfactory recommendation accuracy. To capture multi-modal patterns in the data itself, we go beyond the supervised learning paradigm, and incorporate the idea of self-supervised learning (SSL) into multimedia recommendation. Specifically, SSL consists of two components: (1) data augmentation upon multi-modal contents, where we design three operators — feature dropout (FD), feature masking (FM), feature fine and coarse spaces (FAC) — to generate multiple views of individual items; and (2) contrastive learning, which differentiates the views of an item from the others’ to distill additional supervisory signals. Clearly, SSL enables us to explore and exhibit the underlying relations among modalities, thereby resulting in powerful representations. We denote the generic framework by Self-supervised Learning-guided Multimedia Recommendation (SLMRec). Extensive experiments are performed on three real-world datasets, showing that SLMRec achieves significant improvements over several state-of-the-art baselines like LightGCN [1], MMGCN [2]. Further analysis shows how SSL affects recommendation performance.

Abstract:
Visual tracking is a visual task that tracks a specific target by only giving its first frame location and size. To punish the low-quality but high-scoring tracking results, researchers resorted to foreground reinforcement learning to suppress the scores of positive samples near edges. However, for training with negative samples, all backgrounds are equally labeled as false. In this way, the interdependence and difference between the foreground and the background are not considered. We interpret the underlying reason for drifts as the imbalance between the embedding of background and foreground information. Specifically, some catastrophic tracking results and common tracking errors should not be treated equally but should strengthen the implicit connection between the foreground and background. In this paper, we propose a Mutual Attention (MA) module to strengthen the interdependence between positive and negative samples. It can aggregate the rich contextual interdependence between the target template and the search area, thereby providing an implicit way to update the target template accordingly. As for the difference, we design a background training enhancement (BTE) mechanism to distinguish negative samples with varying degrees of error, that is, to down-weight outrageous and absurd tracking results to improve the robustness of the tracker. The results on a large number of benchmarks indicate the validity of our results, such as OTB-100, VOT-2018, VOT-2019, and LaSOT.

Abstract:
Incomplete multi-view clustering, which aims to solve the clustering problem on the incomplete multi-view data with partial view missing, has received more and more attention in recent years. Although numerous methods have been developed, most of the methods either cannot flexibly handle the incomplete multi-view data with arbitrary missing views or do not consider the negative factor of information imbalance among views. Moreover, some methods do not fully explore the local structure of all incomplete views. To tackle these problems, this paper proposes a simple but effective method, named localized sparse incomplete multi-view clustering (LSIMVC). Different from the existing methods, LSIMVC intends to learn a sparse and structured consensus latent representation from the incomplete multi-view data by optimizing a sparse regularized and novel graph embedded multi-view matrix factorization model. Specifically, in such a novel model based on the matrix factorization, a norm based sparse constraint is introduced to obtain the sparse low-dimensional individual representations and the sparse consensus representation. Moreover, a novel local graph embedding term is introduced to learn the structured consensus representation. Different from the existing works, our local graph embedding term aggregates the graph embedding task and consensus representation learning task into a concise term. Furthermore, to reduce the imbalance factor of incomplete multi-view learning, an adaptive weighted learning scheme is introduced to LSIMVC. Comprehensive experimental results performed on six incomplete multi-view databases verify that the performance of our LSIMVC is superior to the state-of-the-art IMC approaches.

Abstract:
Existing real-time text detectors reconstruct text contours by shrink-masks only. Though they simplify the framework and can make the model run fast, the strong dependence on shrink-masks leads to unreliable detection results (e.g., miss detection and overdetection). Moreover, these methods ignore the information from surrounding pixels, which causes sensitive shrink-masks and accelerates the reliability decline of detection results. Considering the above problems, we construct an effective and efficient text detection network, termed as Reinforcement Shrink-Mask for Text Detection (RSMTD), which strengthens the model's ability to recognize texts while enjoying a high detection speed. Specifically, an effective text representation strategy (Reinforcement Shrink-Mask, RSM) is designed to decouple texts and shrink-masks. RSM builds texts through shrink-masks and reinforcement offsets to ensure stable detection results encountering shrink-masks that deviate from the ground-truth. It is worth noting that reinforcement offsets can force our method to focus on the foreground shapes to bring precise shrink-mask edges. For the robustness improvement of shrink-masks, Super-pixel Window (SPW) is proposed to encourage RSMTD to utilize the surroundings of each pixel to predict shrink-masks. Particularly, SPW treats the interval regions between texts and shrink-masks as background, which helps to suppress interval regions and to avoid text adhesion. Moreover, a lightweight feature merging branch is constructed to further accelerate the inference process. As demonstrated in the experiments, our method is superior to existing state-of-the-art (SOTA) methods in both detection accuracy and speed on multiple benchmarks.

Abstract:
With the diversity of information acquisition, data is stored and transmitted in an increasing number of modalities. Nevertheless, it is not unusual for parts of the data to be lost in some views due to unavoidable acquisition, transmission or storage errors. In this paper, we propose an augmentation-free graph contrastive learning framework to solve the problem of partial multi-view clustering. Notably, we suppose that the representations of similar samples (i.e., belonging to the same cluster) should be similar. This is distinct from the general unsupervised contrastive learning that assumes an image and its augmentations share a similar representation. Specifically, relation graphs are constructed using the nearest neighbors to identify existing similar samples, then the constructed inter-instance relation graphs are transferred to the missing views to build graphs on the corresponding missing data. Subsequently, two main components, within-view graph contrastive learning and cross-view graph consistency learning, are devised to maximize the mutual information of different views within a cluster. The proposed approach elevates instance-level contrastive learning and missing data inference to the cluster-level, effectively mitigating the impact of individual missing data on clustering. Experiments on several challenging datasets demonstrate the superiority of our proposed methods.

Abstract:
Despite the tremendous advances in denoising techniques, it's still challenging to restore a clean image with salient structures based on one noisy observation, especially at high noise levels. In this work, we propose a frequency-domain guided denoising algorithm to conduct denoising with the help of a well-aligned guidance image. Thanks to their structural correlations, the frequency characteristics of the guidance image can indicate whether the frequency coefficients of the noisy target image are contributed by noise or textures. Therefore, the explicit frequency decomposition enables our denoising model to avoid over-smoothing detailed contents. However, as two input images are usually captured in different fields, their structures are not always consistent. Therefore, we model guided denoising with an optimization problem which considers both the representation model of the guidance image and the fidelity to the noisy target. Further, we design a convolutional neural network, called as FGDNet, to explore the optimal solution. Due to the visual masking phenomenon, human eyes are sensitive to noise in the flat areas, but may not perceive noise around edges or textures. Therefore, we expect to remove as much noise as possible to guarantee the spatial smoothness of flat contents, while also preserving high-frequency structures. Through frequency decomposition, our model can process the low-frequency and high-frequency contents separately. We also adopt a frequency-relevant loss function to train the network. Experimental results show that, compared with state-of-the-art guided and non-guided denoisers, our FGDNet achieves higher denoising accuracy and better visual quality in both flat and texture-rich regions.

Abstract:
How to estimate the quality of the network output is an important issue, and currently there is no effective solution in the field of human parsing. To solve this problem, this work proposes a statistical method based on the output probability map to calculate the pixel classification quality, which is called pixel score. In addition, the Quality-Aware Module (QAM) is proposed to fuse the different quality information, the purpose of which is to estimate the quality of human parsing results. We combine QAM with a concise and effective network design to propose Quality-Aware Network (QANet) for human parsing. Benefiting from the superiority of QAM and QANet, we achieve the best performance on three multiple and one single human parsing benchmarks, including CIHP, MHP-v2, Pascal-Person-Part, ATR and LIP. Without increasing the training and inference time, QAM improves the AP^\textr criterion by more than 10 points in the multiple human parsing task. QAM can be extended to other tasks with good quality estimation, e.g instance segmentation. Specifically, QAM improves Mask R-CNN by \scriptstyle ～1% mAP on COCO and LVISv1.0 datasets. Based on the proposed QAM and QANet, our overall system wins 1st place in CVPR2021 L2ID High-resolution Human Parsing (HRHP) Challenge, and 2nd in CVPR2021 PIC Short-video Face Parsing (SFP) Challenge. Code and models are available at https://github.com/soeaver/QANet.

Abstract:
Referring expression comprehension (REC) aims to identify and locate a specific object in visual scenes referred to by a natural language expression. Existing studies of REC only focus on basic visual attributes and neglect scene text. Since scene text has the functions of object identification and disambiguation, it is naturally and frequently used to refer to objects. However, existing methods do not explicitly recognize text in images and fail to align scene text mentioned in expressions with the text shown in images, resulting in object localization errors. This article takes the first step toward addressing these limitations. First, we introduce a new task called scene-text oriented referring expression comprehension, which aims to align visual cues and textual semantics of scene text with referring expressions and visual contents. Second, we propose a scene text awareness network that can bridge the gap between texts from two modalities by grounding visual representations of expression-correlated scene texts. Specifically, we propose a correlated text extraction module to solve the problem of lacking semantic understanding, and a correlated region activation module to address the fixed alignment problem and absent alignment problem. These modules ensure that the proposed method focuses on local regions that are most relevant to scene text, thus mitigating the misalignment of scene text with irrelevant regions. Third, to conduct quantitative evaluations, we establish a new benchmark dataset called RefText. Experimental results demonstrate that the proposed method can effectively comprehend scene-text oriented referring expressions and achieves excellent performance.

Abstract:
Data-free knowledge distillation further broadens the applications of the distillation model. Nevertheless, the problem of providing diverse data with rich expression patterns needs to be further explored. In this paper, a novel dynastic data-free knowledge distillation (D^3K) model is proposed to alleviate this problem. In this model, a dynastic supernet generator (D-SG) with a flexible network structure is proposed to generate diverse data. The D-SG can adaptively alter architectural configurations and activate different subnet generators in different sequential iteration spaces. The variable network structure increases the complexity and capacity of the generator, and strengthens its ability to generate diversified data. In addition, a novel additive constraint based on the differentiable dhash (D-Dhash) is designed to guide the structure parameter selection of the D-SG. This constraint forces the D-SG to constantly jump out of the fixed generation mode and generate diverse data in semantics and instance. The effectiveness of the proposed model is verified on the experimental benchmark datasets (MNIST, CIFAR-10, CIFAR-100, and SVHN).

Abstract:
Recently, Multi-Object Tracking (MOT) has attracted rising attention, and accordingly, remarkable progresses have been achieved. However, the existing methods tend to use various basic models (e.g, detector and embedding model), and different training or inference tricks, etc. As a result, the construction of a good baseline for a fair comparison is essential. In this paper, a classic tracker, i.e., DeepSORT, is first revisited, and then is significantly improved from multiple perspectives such as object detection, feature embedding, and trajectory association. The proposed tracker, named StrongSORT, contributes a strong and fair baseline for the MOT community. Moreover, two lightweight and plug-and-play algorithms are proposed to address two inherent “missing” problems of MOT: missing association and missing detection. Specifically, unlike most methods, which associate short tracklets into complete trajectories at high computation complexity, we propose an appearance-free link model (AFLink) to perform global association without appearance information, and achieve a good balance between speed and accuracy. Furthermore, we propose a Gaussian-smoothed interpolation (GSI) based on Gaussian process regression to relieve the missing detection. AFLink and GSI can be easily plugged into various trackers with a negligible extra computational cost (1.7 ms and 7.1 ms per image, respectively, on MOT17). Finally, by fusing StrongSORT with AFLink and GSI, the final tracker (StrongSORT++) achieves state-of-the-art results on multiple public benchmarks, i.e., MOT17, MOT20, DanceTrack and KITTI. Codes are available at https://github.com/dyhBUPT/StrongSORT and https://github.com/open-mmlab/mmtracking.

Abstract:
The basic goal of Automatic Check-Out (ACO) task is to accurately predict the categories and quantities of products selected by customers in the check-out images. However, there is a significant domain gap between the single-product exemplars as training data and the check-out images as testing data. To mitigate the domain gap, we propose a novel method termed as Prototype Learning for Automatic Check-Out (PLACO). In PLACO, prototype learning is designed to reach the goal in two ways. Specifically, in the prototype-based classifier learning module, to fully exploit the invariance of category prototypes, the prototypes obtained from the single-product exemplars are employed to generate classifiers for classifying the proposals of check-out image. On the other side, in prototype alignment module, prototypes for both the single-product exemplar and check-out image domains are entered simultaneously to ensure intra-category compactness and inter-category sparsity. Moreover, to further improve the performance of PLACO, we develop a discriminative re-ranking module to both adjust the predicted scores of product proposals for bringing more discriminative ability in classifier learning and provide a reasonable sorting possibility by considering the fine-grained nature. Experiments are conducted on the large-scale RPC dataset for evaluations. Our PLACO obtains the optimal results in both traditional ACO task setting and incremental task setting.

Abstract:
In the animation industry, automatically predicting the quality of cartoon images based on the inputs of general distortions and color change is an urgent task, while the existing no-reference (NR) methods usually measure the perceptual quality of the natural images. In this paper, based on the observation that structure and color are the main factors affecting cartoon images quality, we proposed a new NR quality prediction metric for cartoon images, which fully takes gradient and color information into account. The experimental results on our newly constructed NBU-CIQAD dataset with color change and other existing cartoon image dataset demonstrate that the proposed method significantly outperforms existing no-references methods for the task of cartoon image quality assessment. The database and code will be released at https://github.com/1010075746/NBU-CIQAD.

Abstract:
Key semantics can come from everywhere on an image. Semantic alignment is a key part of few-shot learning but still remains challenging. In this paper, we design a Mixer-Based Semantic Spread (MBSS) algorithm that employs a mixer module to spread the key semantic on the whole image, so that one can directly compare the processed image pairs. We first adopt a convolutional neural network to extract features from both support and query images and separate each of them into multiple Local Descriptor-based Representations (LDRs). The LDRs are then fed into the mixer for semantic spread, where every LDR attracts complementary information from its peers. In this way, the objective semantic is made spread on the whole image in a data-driven manner. The overall pipeline is supervised by a voting-based loss, guaranteeing a good mixer. Visualization results validate the feasibility of our mixer. Comprehensive experiments on three benchmark datasets, miniImageNet, tieredImageNet, and CUB, show that our algorithm achieves the state-of-the-art performance in both 5-way 1-shot and 5-way 5-shot settings.

Abstract:
While recent progress has significantly boosted few-shot classification (FSC) performance, few-shot object detection (FSOD) remains challenging for modern learning systems. Existing FSOD systems follow FSC approaches, ignoring critical issues such as spatial variability and uncertain representations, and consequently result in low performance. Observing this, we propose a novel Dual-Awareness Attention (DAnA) mechanism that enables networks to adaptively interpret the given support images. DAnA transforms support images into query-position-aware (QPA) features, guiding detection networks precisely by assigning customized support information to each local region of the query. In addition, the proposed DAnA component is flexible and adaptable to multiple existing object detection frameworks. By adopting DAnA, conventional object detection networks, Faster R-CNN and RetinaNet, which are not designed explicitly for few-shot learning, reach state-of-the-art performance in FSOD tasks. In comparison with previous methods, our model significantly increases the performance by 47% (+6.9 AP), showing remarkable ability under various evaluation settings.

Abstract:
Top-k recommendation is a fundamental task in recommendation systems that is generally learned by comparing positive and negative pairs. The contrastive loss (CL) is the key in contrastive learning that has recently received more attention, and we find that it is well suited for top-k recommendations. However, CL is problematic because it treats the importance of the positive and negative samples the same. On the one hand, CL faces the imbalance problem of one positive sample and many negative samples. On the other hand, there are so few positive items in sparser datasets that their importance should be emphasized. Moreover, the other important issue is that the sparse positive items are still not sufficiently utilized in recommendations. Consequently, we propose a new data augmentation method by using multiple positive items (or samples) simultaneously with the CL loss function. Therefore, we propose a multisample-based contrastive loss (MSCL) function that solves the two problems by balancing the importance of positive and negative samples and data augmentation. Based on the graph convolution network (GCN) method, experimental results demonstrate the state-of-the-art performance of MSCL. The proposed MSCL is simple and can be applied in many methods. Our code is available at https://github.com/haotangxjtu/MSCL.

Abstract:
Deep hashing methods have achieved tremendous success in cross-modal retrieval, due to its low storage consumption and fast retrieval speed. Supervised cross-modal hashing methods have achieved substantial advancement by incorporating semantic information. However, to a great extent, supervised methods rely on large-scale labeled cross-modal training data which are laborious to obtain. Moreover, most cross-modal hashing methods only handle two modalities of image and text, without taking the scene of multiple modalities into consideration. In this paper, we propose a novel semi-supervised approach called semi-supervised knowledge distillation for cross-modal hashing (SKDCH) to overcome the above-mentioned challenges, which enables guiding a supervised method using outputs produced by a semi-supervised method for multimodality retrieval. Specifically, we utilize teacher-student optimization to propagate knowledge. Furthermore, we improves triplet ranking loss to better mitigate the heterogeneity gap, which increases the discriminability of our proposed approach. Extensive experiments executed on two benchmark datasets validate that the proposed SKDCH surpasses the state-of-the-art methods.

Abstract:
Deep learning models often fit undesired dataset bias in training. In this paper, we formulate the bias using causal inference, which helps us uncover the ever-elusive causalities among the key factors in training, and thus pursue the desired causal effect without the bias. We start from revisiting the process of building a visual recognition system, and then propose a structural causal model (SCM) for the key variables involved in dataset collection and recognition model: object, common sense, bias, context, and label prediction. Based on the SCM, one can observe that there are “good” and “bad” biases. Intuitively, in the image where a car is driving on a high way in a desert, the “good” bias denoting the common-sense context is the highway, and the “bad” bias accounting for the noisy context factor is the desert. We tackle this problem with a novel causal interventional training (CIT) approach, where we control the observed context in each object class. We offer theoretical justifications for CIT and validate it with extensive classification experiments on CIFAR-10, CIFAR-100 and ImageNet, e.g., surpassing the standard deep neural networks ResNet-34 and ResNet-50, respectively, by 0.95% and 0.70% accuracies on the ImageNet. Our code is open-sourced on the GitHub https://github.com/qinwei-hfut/CIT.

Abstract:
Due to their excellent performance on aggregating global features, Transformer structures are being widely employed in deep learning-based visual object tracking algorithms, recently. Nevertheless, existing Transformer-based trackers still fail to handle occlusion problems due to drift in feature distributions. To address this issue, we introduce domain adaptation techniques into a novel object tracking framework, DATransT, including feature extraction, domain adaptive Transformer module and prediction head. The domain adaptive Transformer module consists of three weight-sharing branches with self and cross attention mechanisms: the source, the target and the source-target branches. Specifically, the source-target branch employs cross-attention to effectively align the feature distributions of the source and target branches. Meanwhile, we present a pseudo-labeling strategy to generate high-quality training samples. Extensive experiments show that DATransT obtains promising results on several popular datasets, containing LaSOT, TrackingNet, GOT-10k, NfS, OTB2015 and UAV123. Moreover, our method outperforms existing state-of-the-art trackers under full occlusions and partial occlusions.

Abstract:
Semi-supervised learning acts as an effective way to leverage massive unlabeled data. In this paper, we propose a novel training strategy, termed as Semi-supervised Contrastive Learning (SsCL), which combines the well-known contrastive loss in self-supervised learning with the cross entropy loss in semi-supervised learning, and jointly optimizes the two objectives in an end-to-end way. The highlight is that different from self-training based semi-supervised learning that conducts prediction and retraining over the same model weights, SsCL interchanges the predictions over the unlabeled data between the two branches, and thus formulates a co-calibration procedure, which we find is beneficial for better prediction and avoids being trapped in local minimum. Towards this goal, the contrastive loss branch models pairwise similarities among samples, using the pseudo labels generated from the cross entropy branch, and in turn calibrates the prediction distribution of the cross entropy branch with the contrastive similarity. We show that SsCL produces more discriminative representation and is beneficial to semi-supervised learning. Notably, on ImageNet with ResNet50 as the backbone, SsCL achieves \bm 60.2% and \bm 72.1% top-1 accuracy with 1% and 10% labeled samples respectively, which significantly outperforms the baseline, and is better than previous semi-supervised and self-supervised methods.

Abstract:
Semantic segmentation is a classic computer vision task with multiple applications, which includes medical and remote sensing image analysis. Despite recent advances with deep-based approaches, labeling samples (pixels) for training models is laborious and, in some cases, unfeasible. In this paper, we present two novel meta-learning methods, named WeaSeL and ProtoSeg, for the few-shot semantic segmentation task with sparse annotations. We conducted an extensive evaluation of the proposed methods in different applications (12 datasets) in medical imaging and agricultural remote sensing, which are very distinct fields of knowledge and usually subject to data scarcity. The results demonstrated the potential of our method, achieving suitable results for segmenting both coffee/orange crops and anatomical parts of the human body in comparison with full dense annotation.

Abstract:
The large variation of viewpoint and irrelevant content around the target always hinder accurate image retrieval and its subsequent tasks. In this paper, we investigate an extremely challenging task: given a ground-view image of a landmark, we aim to achieve cross-view geo-localization by searching out its corresponding satellite-view images. Specifically, the challenge comes from the gap between ground-view and satellite-view, which includes not only large viewpoint changes (some parts of the landmark may be invisible from front view to top view) but also highly irrelevant background (the target landmark tend to be hidden in other surrounding buildings), making it difficult to learn a common representation or a suitable mapping. To address this issue, we take advantage of drone-view information as a bridge between ground-view and satellite-view domains. We propose a Peer Learning and Cross Diffusion (PLCD) framework. PLCD consists of three parts: 1) a peer learning across ground-view and drone-view to find visible parts to benefit ground-drone cross-view representation learning; 2) a patch-based network for satellite-drone cross-view representation learning; and 3) a cross diffusion between ground-drone space and satellite-drone space. Extensive experiments conducted on the University-Earth and University-Google datasets show that our method outperforms state-of-the-arts significantly.

Abstract:
With increasing popularity of virtual reality and augmented reality, application of point clouds is in critical demand as it enables users to freely navigate in an immersive scene with six degrees of freedom. However, point clouds usually comprise large amounts of data, and are thus difficult to stream in bandwidth-constrained networks. It is therefore important, yet challenging, to efficiently stream the resource-intensive point clouds, such that the user’s quality of experience (QoE) is guaranteed on a high-level but with a low bandwidth consumption. To this end, we propose a QoE-driven adaptive streaming approach for the tile-based point cloud transmission, to maximize the user’s QoE while reducing the transmission redundancy. By exploiting the perspective projection, we specifically model the QoE of a 3D tile as a function of the bitrate of its representation, user’s view frustum and spatial position, occlusion between tiles, and the resolution of rendering device. Based on this QoE model, we then formulate the QoE-optimized rate adaptation problem as a multiple-choice knapsack problem, which allocates bitrates for different tiles under a given transmission capacity. It is equivalently converted to a submodular function maximization problem subject to knapsack constraints, and solved by a practical greedy-based algorithm with a theoretical worst-case performance guarantee. The proposed algorithm is able to achieve a near-optimal performance, but with a very low computational complexity. Experimental results further demonstrate superiority of the proposed rate adaptation algorithm over existing schemes, in terms of both user’s visual quality and transmission efficiency.

Abstract:
In this paper, a Part-aware Relation Modeling (PRM) is developed to handle the task of human parsing. For pixel-level recognition, it is essential to generate features with adaptive context for various sizes and shapes of human parts. To address the issue, we adaptively capture contexts based on the part-aware relation mechanism. PRM mainly consists of three modules, including a part class module, a part-relation aggregation module, and a part-relation dispersion module. The part class module selectively enhances spatial details of the high-level features to obtain enhanced original features, and then extracts the high-level representations of every human part from a categorical perspective. The part-relation aggregation module is developed to extract the representative global context by exploring associated semantics of human parts, adaptively augmenting the context for human parts. The part-relation dispersion module is designed to generate the discriminative and effective local context and neglect the distracting one by making the affinity of human parts disperse. It ensures that features of the same class will be close to each other and away from those of different classes. By fusing the outputs of the two part-relation modules and the first outputs of the part class module, our PRM produces adaptive contextual features for diverse sizes of human parts, boosting the parsing accuracy. Extensive experiments are conducted to validate the effectiveness of our network, and a new state-of-the-art segmentation performance is achieved on three challenging human parsing datasets, i.e., PASCAL-Person-Part, LIP, and CIHP. PRM is also extended to other tasks like animal parsing, and exhibits its generality.

Abstract:
Semantic segmentation is a fundamental problem in multimedia which requires delicate per-pixel predictions of object categories. Recently, many researchers strive to refine the pixel-wise feature with spatial-contextual information. However, many of them still neglect the invisible hand of cross-channel information which provides inherent semantics to facilitate the segmentation performance. On the one hand, in the feature extraction stage, enhancing informative channels and suppressing trivial ones contribute to the acquisition of valuable semantic features, and thus improving the segmentation accuracy. On the other hand, in the prediction stage, we can predict the complete objects more clearly by finding the connections and complements between different channels, which can also contribute to the pixel prediction. And based on this idea, we propose a novel Channel-Adaptive Network for semantic segmentation, which is capable of enhancing the features from the perspective of channels in both feature extraction stage and prediction stage. Specifically, we propose two modules: (i) the Comprehensive Information Channel Attention (CiCA) module that addresses the shortcomings of existing channel attention by learning both low and high frequency components within each channel for emphasizing the informative channels; (ii) the Inter-Channel Relationship Reasoning (iCRR) module which is applied on the top of the feature extractor to adaptively enhance the interdependent channels by mining the complementary associations between them. Besides, our Channel-Adaptive Network is highly flexible, with a plug-and-play design. Extensive experiments have demonstrated that our method achieves the state-of-the-art segmentation performance on three challenging datasets, including Cityscapes (82.1%), ADE20K (46.51%) and PASCAL Context (55.0%).

Abstract:
Cross-modal retrieval aims to retrieve relevant data from another modality when given a query of one modality. Although most existing methods that rely on the label information of multimedia data have achieved promising results, the performance benefiting from labeled data comes at a high cost since labeling data often requires enormous labor resources, especially on large-scale multimedia datasets. Therefore, unsupervised cross-modal learning is of crucial importance in real-world applications. In this paper, we propose a novel unsupervised cross-modal retrieval method, named Self-supervised Correlation Learning (SCL), which takes full advantage of large amounts of unlabeled data to learn discriminative and modality-invariant representations. Since unsupervised learning lacks the supervision of category labels, we incorporate the knowledge from the input as a supervisory signal by maximizing the mutual information between the input and the output of different modality-specific projectors. Besides, for the purpose of learning discriminative representations, we exploit unsupervised contrastive learning to model the relationship among intra- and inter-modality instances, which makes similar samples closer and pushes dissimilar samples apart. Moreover, to further eliminate the modality gap, we use a weight-sharing scheme and minimize the modality-invariant loss in the joint representation space. Beyond that, we also extend the proposed method to the semi-supervised setting. Extensive experiments conducted on three widely-used benchmark datasets demonstrate that our method achieves competitive results compared with current state-of-the-art cross-modal retrieval approaches.

Abstract:
In this work, we focus on Interactive Human Parsing (IHP), which aims to segment a human image into multiple human body parts with guidance from users’ interactions. This new task inherits the class-aware property of human parsing, which cannot be well solved by traditional interactive image segmentation approaches that are generally class-agnostic. To tackle this new task, we first exploit user clicks to identify different human parts in the given image. These clicks are subsequently transformed into semantic-aware localization maps, which are concatenated with the RGB image to form the input of the segmentation network and generate the initial parsing result. To enable the network to better perceive user's purpose during the correction process, we investigate several principal ways for the refinement, and reveal that random-sampling-based click augmentation is the best way for promoting the correction effectiveness. Furthermore, we also propose a semantic-perceiving loss (SP-loss) to augment the training, which can effectively exploit the semantic relationships of clicks for better optimization. To the best knowledge, this work is the first attempt to tackle the human parsing task under the interactive setting. Our IHP solution achieves 85% mIoU on the benchmark LIP, 80% mIoU on PASCAL-Person-Part and CIHP, 75% mIoU on Helen with only 1.95, 3.02, 2.84 and 1.09 clicks per class respectively. These results demonstrate that we can simply acquire high-quality human parsing masks with only a few human effort. We hope this work can motivate more researchers to develop data-efficient solutions to IHP in the future.

Affiliations: College of Telecommunications and Information Engineering, Nanjing University of Posts and Telecommunications, Nanjing, China; College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing, China; Guangdong Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen, China; School of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu, China; School of Computer Science and Engineering, Nanyang Technological University, Singapore

Abstract:
In recent video-based point cloud compression (V-PCC), 3D point clouds are projected onto 2D images and compressed by High-Efficiency Video Coding (HEVC). However, HEVC was originally designed for natural visual signals, which is a suboptimal framework for point clouds. Therefore, there are still problems in geometry information compression in V-PCC: (1) The distortion based on the sum of squared error (SSE) in the existing rate-distortion optimization (RDO) is inconsistent with the geometric quality measurement; (2) The existing prediction cannot explore the fixed relationship between the corresponding far layer and near layer depth, which means that the far layer depth can be always not less than the corresponding near layer depth. In this paper, we present an efficient geometry surface coding (EGSC) method for V-PCC to address the problems. Firstly, an error projection (EP) model is designed to establish the relationship between the SSE-based distortion and the geometry quality metric. Secondly, an EP-based RDO is employed to improve the geometry information compression by estimating the point normals with gradients. Finally, an occupancy-map driven scheme is proposed to improve the prediction accuracy of merge modes. Experimental results show that the proposed method achieves an average of over 10% bit-rate saving compared with the V-PCC reference software.

Abstract:
Recent studies have achieved remarkable success using deep generative models for the image animation of an arbitrary object.However, previous methods synthesize animated results in a frame-by-frame manner, which is prone to producing flickering and temporally inconsistent results. In this paper, we propose a novel self-supervised framework leveraging temporal information for image animation. Our framework processes a video clip directly instead of processing each frame independently. To achieve coherence in the animated video, we design a spatial-temporal correspondence network (STCN) to maintain the consistency of the keypoints. Specifically, the STCN takes full advantage of temporal information to propagate the keypoints between adjacent frames, and it can be trained with consistent keypoints during the forward and backward process. Furthermore, we apply a 3D-CNN-based generator and discriminator in our framework to ensure coherence in the final output video. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method.

Abstract:
Color constancy is the ability to remove the effect of illumination on color. Since color constancy is an ill-posed problem, many methods have been proposed based on assumptions to constraint the solution space. However, most existing assumptions require specular pixels or abundant colors, and fail to produce satisfactory results for different scenarios. According to extensive experiments, we observe that the chromaticity distribution of pixels within main color under canonical illumination, which we called canonical pixels, is linear and can also locate the position of illumination under the non-canonical illumination. Therefore, this paper proposes a chromaticity-line prior (CLP) as an additional linear constraint on the ill-posed problem of color constancy. In the calculation of CLP, the simple linear iterative clustering is firstly employed to segment an image into several super-pixel blocks. And the random sampling consensus is utilized to remove non-primary color points and fit the chromaticity-line. Based on the proposed CLP, a color constancy algorithm is implemented correspondingly. Since the main idea of the CLP is to extract the canonical pixels, which is the inherent property of image, the proposed CLP is more general and adaptive in real scenes. The experiments on two public datasets demonstrate that the proposed algorithm not only outperforms state-of-the-art learning-free algorithms, but also achieves results that are competitive to those of learning-based algorithms.

Abstract:
Referring segmentation aims to generate a segmentation mask for the target instance indicated by a natural language expression. There are typically two kinds of existing methods: one-stage methods that directly perform segmentation on the fused vision and language features; and two-stage methods that first utilize an instance segmentation model for instance proposal and then select one of these instances via matching them with language features. In this work, we propose a novel framework that simultaneously detects the target-of-interest via feature propagation and generates a fine-grained segmentation mask. In our framework, each instance is represented by an Instance-Specific Feature (ISF), and the target-of-referring is identified by exchanging information among all ISFs using our proposed Feature Propagation Module (FPM). Our instance-aware approach learns the relationship among all objects, which helps to better locate the target-of-interest than one-stage methods. Comparing to two-stage methods, our approach collaboratively and interactively utilizes both vision and language information for synchronous identification and segmentation. In the experimental tests, our method outperforms previous state-of-the-art methods on all three RefCOCO series datasets.

Abstract:
Smoothing images while preserving salient edges is a crucial task in computational photography. Existing edge-preserving filters suffer from various artifacts, such as halos, gradient reversals, and intensity shifts. Observing that various artifacts are strongly related to salient edges with large gradients, we propose a continuous mapping function to process the gradients. The proposed function is literally edge-preserving, i.e., it keeps large gradients intact while attenuating small gradients. We propose an L_1-regularized reconstruction model based on the processed gradients for edge-preserving image filtering. The L_1-regularization facilitates the edge-preserving property in the reconstructed results. To solve the proposed L_1-regularized model, we implement an efficient algorithm based on the alternating direction method of multipliers (ADMM) and Fourier domain optimization. We have conducted qualitative and quantitative experiments to evaluate the proposed filter. The results demonstrate that our filter better handles various artifacts and delivers superior image quality on various applications. The proposed filter is highly efficient, our GPU implementation takes 70 ms to process a color image with 1 megapixel on an NVIDIA GTX 1070 GPU.

Abstract:
Successful image classification requires a discriminative representation learning model for images. To approach this idea, deep metric learning (DML), serving as building a basic feature space with a pre-defined metric, has demonstrated compelling performance over the years. DML is often implemented with a carefully crafted loss function, such as the representative triplet loss, which encourages a positive sample to be by a fixed margin closer to the anchor than the negative. Despite its efficacy, the negative samples are treated uniformly, rendering the feature space less informative since different negative samples can be largely different from the anchor. In this work, we, for the first time, propose to exploit the semantic information inherent in discrete class labels as an aid for the triplet loss. Specifically, we build a bi-level negative sampling strategy, i.e., strong negative and weak negative sampling, with the guidance of an external knowledge source, from which rich class semantics can be extracted. With several fine-grained and complementary triplet losses based on this strategy, our method is enhanced with semantic awareness for image classification. In addition, to coordinate with the complicated training dynamics, we devise an ad-hoc Semantic Relation Weighting module, which consistently inspects model states and dynamically adjusts the importance of each triplet loss. It is worth noting that our method is plug-and-play, and we thus test its validity over various backbones and knowledge sources. Both qualitative and quantitative experimental results on benchmark datasets demonstrate the effectiveness of employing semantics for image classification.

Abstract:
Real-world image super-resolution is a practical image restoration problem that aims to obtain high-quality images from in-the-wild input, has recently received considerable attention with regard to its tremendous application potentials. Although deep learning-based methods have achieved promising restoration quality on real-world image super-resolution datasets, they ignore the relationship between L1- and perceptual- minimization and roughly adopt auxiliary large-scale datasets for pre-training. In this paper, we discuss the image types within a corrupted image and the property of perceptual- and Euclidean- based evaluation protocols. Then we propose a method, Real-World image Super-Resolution by Exclusionary Dual-Learning (RWSR-EDL) to address the feature diversity in perceptual- and L1- based cooperative learning. Moreover, a noise-guidance data collection strategy is developed to address the training time consumption in multiple datasets optimization. When an auxiliary dataset is incorporated, RWSR-EDL achieves promising results and repulses any training time increment by adopting the noise-guidance data collection strategy. Extensive experiments show that RWSR-EDL achieves competitive performance over state-of-the-art methods on four in-the-wild image super-resolution datasets.

Abstract:
Narrative videos usually illustrate the main content through multiple narrative information such as audios, video frames and subtitles. Existing video summarization approaches rarely consider the multiple dimensional narrative inputs, or ignore the impact of shots artistic assembly when directly applied to narrative videos. This paper introduces a multimodal-based and aesthetic-guided narrative video summarization method. Our method leverages multimodal information including visual content, subtitles and audio information through our specified key shots selection, subtitle summarization, and highlight extraction components. Furthermore, under the guidance of cinematographic aesthetic, we design a novel shots assembly module to ensure the shot content completeness and then assemble the selected shots into a desired summary. Besides, our method also provides the flexible specification for shots selection, to achieve which it automatically selects semantically related shots according to the user-designed text. By conducting a large number of quantitative experimental evaluations and user studies, we demonstrate that our method effectively preserves important narrative information of the original video, and it is capable of rapidly producing high-quality and aesthetic-guided narrative video summaries.

Abstract:
Various image enhancement algorithms are adopted to improve underwater images that often suffer from visual distortions. It is critical to assess the output quality of underwater images undergoing enhancement algorithms, and use the results to optimise underwater imaging systems. In our previous study, we created a benchmark for quality assessment of underwater image enhancement via subjective experiments. Building on the benchmark, this paper proposes a new objective metric that can automatically assess the output quality of image enhancement, namely UWEQM. By characterising specific underwater physics and relevant properties of the human visual system, image quality attributes are computed and combined to yield an overall metric. Experimental results show that the proposed UWEQM metric yields good performance in predicting image quality as perceived by human subjects.

Abstract:
Personalized image aesthetics assessment (IAA) aims to estimate aesthetic experiences subject to the preferences of individual users, contrary to generic IAA that estimates aesthetic experiences subject to average preferences. Most existing personalized IAA methods treat personalized aesthetic experiences as deviations from a generic aesthetic experience, and therefore, personalized IAA models are designed to build upon the prior knowledge on generic IAA. However, we propose that acquiring knowledge on generic IAA is not necessary for building a personalized IAA model. Instead of modeling personalized IAA on the basis of generic IAA, this work proposes to directly estimate personalized aesthetic experiences from the interactions between image contents and user preferences (i.e., preference-content interaction), where interaction-matrices representing preference-content interactions are constructed without needs for prior generic IAA knowledge. To this end, we construct interaction-matrices from content features constructed from pre-trained image classification features and latent preference features. To realize a robust interaction-matrix based personalized IAA model, we discuss in detail on different strategies for constructing interaction-matrices and estimating personalized aesthetic scores from the interaction-matrices. Besides the personalized IAA scenario, we further propose strategies to adapt the proposed personalized IAA model to different scenarios of generic IAA. Extensive experiments show that: 1) our method significantly outperforms 5 previous relevant personalized IAA methods on FLICKR-AES dataset, especially the methods that require generic IAA knowledge as the basis; 2) in terms of generic IAA, the proposed approach also outperforms 13 generic IAA methods on AVA dataset.

Abstract:
The image distortions are complex and dynamically changing in the real-world scenario, due to the fast development of the image processing system. The blind image quality assessment (BIQA) models may encounter the challenge of processing images with distortion types never seen before deployment. However, existing BIQA models generally cannot evolve with unseen distortion types adaptively, which greatly limits the deployment and application of BIQA models in real-world scenarios. To address this problem, we propose a novel Lifelong blind Image Quality Assessment (LIQA) approach, targeting to achieve the lifelong learning of BIQA. Without accessing to previous training data, our proposed LIQA can not only learn new knowledge, but also mitigate the catastrophic forgetting of learned knowledge. Specifically, we adopt the Split-and-Merge distillation strategy to train a single-head network that makes task-agnostic predictions. In the split stage, we first employ a distortion-specific generator to generate pseudo features of each previously seen distortion. Then, we utilize an auxiliary multi-head regression network to keep the response of each distortion. In the merge stage, we replay the pseudo features and use the pseudo labels generated by the auxiliary multi-head network to distill the knowledge of the multiple heads, which can build the final regression single head. Extensive experiments demonstrate that LIQA can perform well in handling both inner-dataset distortion shift and cross-dataset distortion shift. More importantly, our model can achieve stable performance even if the task sequences are long.

Abstract:
Feature representation learning is a key component in 3D point cloud analysis. However, the powerful convolutional neural networks (CNNs) cannot be applied due to the irregular structure of point clouds. Therefore, following the tremendous success of transformer in natural language processing and image understanding tasks, in this paper, we present a novel point cloud representation learning architecture, named Dual Transformer Network (DTNet), which mainly consists of Dual Point Cloud Transformer (DPCT) module. Specifically, by aggregating the well-designed point-wise and channel-wise self-attention models simultaneously, DPCT module can capture much richer contextual dependencies semantically from the perspective of position and channel. With the DPCT model as a fundamental component, we construct the DTNet for performing 3D point cloud analysis in an end-to-end manner. Extensive quantitative and qualitative experiments on publicly available benchmarks demonstrate the effectiveness of our transformer framework for the tasks of 3D point cloud classification, segmentation and visual object affordance understanding, achieving highly competitive performance in comparison with the state-of-the-art approaches.

Abstract:
With increasing demands for high-quality semantic segmentation in the industry, hard-distinguishing semantic boundaries have posed a significant threat to existing solutions. Inspired by real-life experience, i.e., combining varied observations contributes to higher visual recognition confidence, we present the equipotential learning (EPL) method. This novel module transfers the predicted/ground-truth semantic labels to a self-defined potential domain to learn and infer decision boundaries along customized directions. The conversion to the potential domain is implemented via a lightweight differentiable anisotropic convolution without incurring any parameter overhead. Besides, the designed two loss functions, the point loss and the equipotential line loss implement anisotropic field regression and category-level contour learning, respectively, enhancing prediction consistencies in the inter/intra-class boundary areas. More importantly, EPL is agnostic to network architectures, and thus it can be plugged into most existing segmentation models. This paper is the first attempt to address the boundary segmentation problem with field regression and contour learning. Meaningful performance improvements on Pascal Voc 2012 and Cityscapes demonstrate that the proposed EPL module can benefit the off-the-shelf fully convolutional network models when recognizing semantic boundary areas. Besides, intensive comparisons and analysis show the favorable merits of EPL for distinguishing semantically-similar and irregular-shaped categories.

Abstract:
Low bit-width quantization can effectively reduce the storage and computational costs of deep neural networks. Existing quantization methods are commonly designed for single model compression. For multi-model compression scenarios, multiple models for the same task or similar tasks need to be compressed simultaneously in multimedia tasks, such as compressing image super-resolution models for different scales and transferring of different models in multimedia. However, single model quantization methods do not consider the correlations among the weights of different models, which limits the further compression for the above multi-model compression scenarios. To sufficiently excavate the potential of compression on multi-model, we propose a novel quantization scheme for multi-model compression, namely differential weight quantization (DWQ), which focuses on the weights increment between the target model and the reference model. Specifically, DWQ is achieved by increment computation, increment quantization and fine-tuning, which utilizes the reference model to guide the subsequent quantization on the target model. Due to the correlations between the weights of different models, the distribution of weights increment is more centralized compared with original weights, which can achieve a higher compression ratio by lower bit representation on weights increment. Moreover, the progressive training method is proposed to accelerate the convergence and reduce quantization loss on the DWQ framework. Extensive experiments validate the effectiveness of DWQ based on weight-sharing and parameterized clipping activation (PACT) quantization technologies on multiple tasks. The proposed framework can achieve 2× compression improvement and reduce 30% computational complexity with comparable performance in the popular multimedia tasks.

Abstract:
Effective fusion of different types of features is the key to salient object detection (SOD). The majority of the existing network structure designs are based on the subjective experience of scholars, and the process of feature fusion does not consider the relationship between the fused features and the highest-level features. In this paper, we focus on the feature relationship and propose a novel global attention unit, which we term the “perception-and-regulation” (PR) block, that adaptively regulates the feature fusion process by explicitly modelling the interdependencies between features. The perception part uses the structure of the fully connected layers in the classification networks to learn the size and shape of the objects. The regulation part selectively strengthens and weakens the features to be fused. An imitating eye observation module (IEO) is further employed to improve the global perception capabilities of the network. The imitation of foveal vision and peripheral vision enables the IEO to scrutinize highly detailed objects and to organize a broad spatial scene to better segment objects. Sufficient experiments conducted on the SOD datasets demonstrate that the proposed method performs favourably against the 29 state-of-the-art methods.

Abstract:
Shadow removal is a challenging computer vision and multimedia task that aims to restore image content in shadow regions. The state-of-the-art shadow removal methods introduce artifacts near shadow boundaries or inconsistencies between shadow and nonshadow areas, which can be easily noticed by the human eye at first glance. In this paper, we design a boundary-aware shadow removal network (BA-ShadowNet) that improves shadow removal accuracy by increasing the removal performance at shadow boundaries. In contrast with previously developed methods, which usually consider shadow boundary optimization to be a postprocessing technique, our method performs shadow removal and shadow boundary optimization simultaneously. For this purpose, the proposed BA-ShadowNet is designed as a multiscale encoder-decoder structure, where the decoder consists of a shadow removal branch and a shadow optimization branch. An interaction module is then introduced to fuse and exchange the features of the two branches. This module facilitates the removal branch in perceiving the locations and colors of shadow boundaries. Additionally, it optimizes the boundary branch according to the image context extracted from the removal branch. A three-term loss function is further developed to supervise the shadow removal results and to address the issue of imbalanced supervision between shadow boundary pixels and pixels inside shadows. Extensive experiments conducted on the ISTD+ and SRD datasets demonstrate that the proposed BA-ShadowNet greatly outperforms the state-of-the-art methods with respect to shadow removal.

Abstract:
Many multimodal recommender systems have been proposed to exploit the rich side information associated with users or items (e.g., user reviews and item images) for learning better user and item representations to improve the recommendation performance. Studies from psychology show that users have individual differences in the utilization of various modalities for organizing information. Therefore, for a certain factor of an item (such as appearance or quality), the features of different modalities are of varying importance to a user. However, existing methods ignore the fact that different modalities contribute differently towards a user's preference on various factors of an item. In light of this, in this paper, we propose a novel Disentangled Multimodal Representation Learning (DMRL) recommendation model, which can capture users' attention to different modalities on each factor in user preference modeling. In particular, we employ a disentangled representation technique to ensure the features of different factors in each modality are independent of each other. A multimodal attention mechanism is then designed to capture users' modality preference for each factor. Based on the estimated weights obtained by the attention mechanism, we make recommendations by combining the preference scores of a user's preferences to each factor of the target item over different modalities. Extensive evaluation on five real-world datasets demonstrate the superiority of our method compared with existing methods.

Abstract:
Image-based virtual try-on focuses on changing the model's garment item to the target ones and preserving other visual features. To preserve the texture detail of the given in-shop garment, former methods use geometry-based methods (e.g., Thin-plate-spline interpolation) to realize garment warping. However, due to limited degree of freedom, geometry-based methods perform poorly when garment self-occlusion occurs, which is common in daily life. To address this challenge, we propose a novel occlusion-focused virtual try-on system. Compared to previous ones, our system contains three critical submodules, namely, Garment Part Modeling (GPM), a group of Garment Part Generators (GPGs), and Overlap Relation Estimator (ORE). GPM takes the pose landmarks as input, and progressively models the mask of body parts and garments. Based on these masks, GPGs are introduced to generate each garment part. Finally, ORE is proposed to model the overlap relationships between each garment part, and we bind the generated garments under the guidance of overlap relationships predicted by ORE. To make the most of extracted overlap relationships, we proposed an IoU-based hard example mining method for loss terms to handle the sparsity of the self-occlusion samples in the dataset. Furthermore, we introduce part affinity field as pose representation instead of landmark used widely by previous methods and achieve accuracy improvement on try-on layout estimation stage. We evaluate our model on the VITON dataset and found it can outperform previous approaches, especially on samples with garment self-occlusion.

Abstract:
Imagine an interesting situation when watching a movie, we can scan the screen using our smartphones to get some extra information about this movie such as the cast, the release date, the movie's homepage, etc. Our prospect is a world where each video contains invisible information that can be delivered to us through mobile devices with cameras. This paper proposes the first deep learning-based information hiding method for videos to achieve information transmission from screens to cameras. Compared with hiding information in single images, the methods for videos need to maintain visual quality in both spatial and temporal domains. Furthermore, the training of video models builds on a large video dataset, which needs much more computational resources than training models for images. To reduce the computational complexity, we propose to simulate data on-the-fly to generate simulated sequences from single images. Then, we use the simulated data to train a spatio-temporal generator that hides information in videos while maintaining visual quality. During training, a temporal loss function based on the simulated data is exploited to ensure the temporal consistency of generated videos. After embedding, we use a decoder to recover the hidden information. To simulate the imaging pipeline from screens to cameras in the real world, we insert a distortion network between the generator and decoder. The distortion network is based on differentiable 3D rendering to cover possible distortions introduced in the procedure of camera imaging. Experimental results show that the hidden information in videos can be extracted by cameras without impacting the visual quality. Our work can be applied to many fields, such as advertisement, entertainment, and education.

Abstract:
Deep neural networks (DNNs) have greatly contributed to the performance gains in semantic segmentation. Nevertheless, training DNNs generally requires large amounts of pixel-level labeled data, which is expensive and time-consuming to collect in practice. To mitigate the annotation burden, this paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation. In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN. Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model, the latter of which is a common barrier shared by most adversarial training-based methods. We theoretically analyze SE-GAN and provide an \mathcal O(1/\sqrtN) generalization bound (N is the training sample size), which suggests controlling the discriminator's hypothesis complexity to enhance the generalizability. Accordingly, we choose a simple network as the discriminator. Extensive and systematic experiments in two standard settings demonstrate that the proposed method significantly outperforms current state-of-the-art approaches.

Abstract:
Zero-Shot Learning (ZSL) aims to transfer classification capability from seen to unseen classes. Recent methods have proved that generalization and specialization are two essential abilities to achieve good performance in ZSL. However, focusing on only one of the abilities may result in models that are either too general with degraded classification ability or too specialized to generalize to unseen classes. In this article, we propose an end-to-end network, termed as BGSNet, which equips and balances generalization and specialization abilities at the instance and dataset level. Specifically, BGSNet consists of two branches: the Generalization Network (GNet), which applies episodic meta-learning to learn generalized knowledge, and the Balanced Specialization Network (BSNet), which adopts multiple attentive extractors to extract discriminative features and achieve instance-level balance. A novel self-adjusted diversity loss is designed to optimize BSNet with redundancy reduced and diversity boosted. We further propose a differentiable dataset-level balance and update the weights in a linear annealing schedule to simulate network pruning and thus obtain the optimal structure for BSNet with dataset-level balance achieved. Experiments on four benchmark datasets demonstrate our model's effectiveness. Sufficient component ablations prove the necessity of integrating and balancing generalization and specialization abilities.

Abstract:
The recent advancement in vision-and-language pretraining (VLP) has significantly improved the performance of cross-modal image-text retrieval (ITR) systems. However, the increasing size of VLP models presents a challenge for real-world deployment due to their high latency, making them unsuitable for practical search scenarios. To alleviate this problem, we present a novel plug-in dynamic contrastive distillation (DCD) framework to compress the large VLP models for the ITR task. Technically, we face the following two challenges: 1) the typical uni-modal metric learning approach is difficult to directly apply to cross-modal tasks due to the limited GPU memory to optimize too many negative samples during handling cross-modal fusion features. 2) it is inefficient to static optimize the student network from different hard samples, which affects distillation learning and student network optimization. We propose a method for multi-modal contrastive learning that balances training costs and effects. Our approach involves using a teacher network to identify hard samples for student networks to learn from, allowing the students to leverage the knowledge from pre-trained teachers and effectively learn from hard samples. To learn from hard sample pairs, we propose dynamic distillation to dynamically learn samples of different difficulties to balance better the difficulty of knowledge and students' self-learning ability. We successfully apply our proposed DCD strategy on two state-of-the-art vision-language pretrained models, i.e., ViLT and METER. Extensive experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework. We further provide in-depth analyses and discussions that explain how the performance improves.

Abstract:
With recent advances in image-to-image translation tasks, remarkable progress has been witnessed in generating face images from sketches. However, existing methods frequently fail to generate images with details that are semantically and geometrically consistent with the input sketch, especially when various decoration strokes are drawn. To address this issue, we introduce a novel \mathcal W-\mathcal W^+ encoder architecture to take advantage of the high expressive power of \mathcal W^+ space and semantic controllability of \mathcal W space. We introduce an explicit intermediate representation for sketch semantic embedding. With a semantic feature matching loss for effective semantic supervision, our sketch embedding precisely conveys the semantics in the input sketches to the synthesized images. Moreover, a novel sketch semantic interpretation approach is designed to automatically extract semantics from vectorized sketches. We conduct extensive experiments on both synthesized sketches and hand-drawn sketches, and the results demonstrate the superiority of our method over existing approaches on both semantics-preserving and generalization ability.

Abstract:
We address the challenging task of human reaction generation, which aims to generate a corresponding reaction based on an input action. Most of the existing works do not focus on generating and predicting the reaction and cannot generate the motion when only the action is given as input. To address this limitation, we propose a novel interaction Transformer (InterFormer) consisting of a Transformer network with both temporal and spatial attention. Specifically, temporal attention captures the temporal dependencies of the motion of both characters and of their interaction, while spatial attention learns the dependencies between the different body parts of each character and those which are part of the interaction. Moreover, we propose using graphs to increase the performance of spatial attention via an interaction distance module that helps focus on nearby joints from both characters. Extensive experiments on the SBU interaction, K3HI, and DuetDance datasets demonstrate the effectiveness of InterFormer. Our method is general and can be used to generate more complex and long-term interactions.

Abstract:
Shadow generation aims to generate a plausible shadow for the inserted foreground object in a composite image. Besides the composite image and the associated mask of the inserted foreground object, existing methods also require a mask of all background objects as well as their shadows as an auxiliary input, which is laborious in practical applications. Meanwhile, most existing methods use a linear illumination transformation to darken the shadow region, which is prone to produce unrealistic shadows especially when background illumination is complex. To address these problems, this paper proposes an automatic shadow generation method, which avoids the laborious acquisition of the background object masks while harmonizing the shadow region to achieve plausible shadow effects. Specifically, to implicitly exploit background illumination to infer the shadow shape of the inserted foreground object, we first propose a Hierarchy Attention U-Net (HAU-Net) to sequentially build global interactions between the foreground object and background across spatial and channel dimensions. Since the spatial-variant property of the shadow, we formulate shadow harmonization as an exposure fusion problem and propose an Illumination-Aware Fusion Network (IFNet), which uses an improved illumination model with a double linear transformation to produce multiple under-exposure images of the shadow region. IFNet then learns pixel-wise fusion kernels that consider the local smoothness of the shadow to fuse the composite image with these under-exposure images to generate the realistic shadow of the foreground object. Extensive experiments on the DESOBA and Shadow-AR datasets demonstrate that our method achieves state-of-the-art performance for shadow generation on both the BOS and BOS-free test images.

Abstract:
Change captioning is to describe the semantic change between a pair of similar images in natural language. It is more challenging than general image captioning, because it requires capturing fine-grained change information while being immune to irrelevant viewpoint changes, and solving syntax ambiguity in change descriptions. In this paper, we propose a neighborhood contrastive transformer to improve the model's perceiving ability for various changes under different scenes and cognition ability for complex syntax structure. Concretely, we first design a neighboring feature aggregating to integrate neighboring context into each feature, which helps quickly locate the inconspicuous changes under the guidance of conspicuous referents. Then, we devise a common feature distilling to compare two images at neighborhood level and extract common properties from each image, so as to learn effective contrastive information between them. Finally, we introduce the explicit dependencies between words to calibrate the transformer decoder, which helps better understand complex syntax structure during training. Extensive experimental results demonstrate that the proposed method achieves the state-of-the-art performance on three public datasets with different change scenarios.

Abstract:
Accurate trajectory prediction of surrounding agents is an important issue for building up an intelligent transportation system. Frequent interactions among agents have a major impact on their movement patterns. Current research mainly relies on agents’ spatial structure associated with the last frame of the observation to model social interactions, while paying less attention to structure information from previous moments. In addition, existing methods merely consider temporal features of a single trajectory sequence, while neglecting temporal dependencies across multiple trajectories. In this work, we endeavor to capture comprehensively social interactions among agents with the proposed Spatio-Temporal Sequence Fusion Network (STSF-Net). Specifically, we construct a spatio-temporal sequence that encodes contextual information taking explicitly spatial distributions of agents during movement into account while capturing socially temporal dependencies across multiple trajectory sequences. Besides, a social recurrent mechanism is introduced to explicitly capture temporal correlations between interactions by concerning spatial structure at each time-step. Finally, our model is evaluated on datasets covering pedestrian, vehicle, and heterogeneous multi-agent trajectories. Experimental evidence manifests that our method achieves excellent performance.

Abstract:
Graph convolutional neural network (GCN) has effectively boosted the multi-label image recognition task by modeling correlation among labels. In previous methods, label correlation is computed based on statistical information through label diffusion, and therefore the same for all samples. This, however, makes graph inference on labels insufficient to handle huge variations among numerous image instances. In this paper, we propose an instance-aware graph convolutional neural network (IA_GCN) framework for the multi-label classification. As a whole, two fused branches of sub-networks are involved in the framework: a global branch modeling the whole image and a local branch exploring dependencies among regions of interests (ROIs). For both the branches, an image-dependent label correlation matrix (ID_LCM), fusing both the statistical label correlation matrix (LCM) and an individual one of each image instance, is constructed to inject adaptive information of label-awareness into the learned features of the model through graph convolution. Specifically, the individual LCM of each image is obtained by mining the label dependencies based on the predicted label scores of those detected ROIs. In this process, considering the contribution differences of ROIs to multi-label classification, variational inference is introduced to learn adaptive scaling factors for those ROIs by considering their complex distribution. Finally, extensive experiments on MS-COCO and VOC datasets show that our proposed approach outperforms existing state-of-the-art methods.

Abstract:
Neural network quantization has shown to be an effective way for network compression and acceleration. However, existing binary or ternary quantization methods suffer from two major issues. First, low bit-width input/activation quantization easily results in severe prediction accuracy degradation. Second, network training and quantization are always treated as two non-related tasks, leading to accumulated parameter training error and quantization error. In this work, we introduce a novel scheme, named Residual Quantization, to train a neural network with both weights and inputs constrained to low bit-width, e.g., binary or ternary values. On one hand, by recursively performing residual quantization, the resulting binary/ternary network is guaranteed to approximate the full-precision network with much smaller errors. On the other hand, we mathematically re-formulate the network training scheme in an EM-like manner, which iteratively performs network quantization and parameter optimization. During expectation, the low bit-width network is encouraged to approximate the full-precision network. During maximization, the low bit-width network is further tuned to gain better representation capability. Extensive experiments well demonstrate that the proposed quantization scheme outperforms previous low bit-width methods and achieves much closer performance to the full-precision counterpart.

Abstract:
Stitching images with parallax for naturalness remains a challenging problem. This paper proposes an image stitching method which preserves the flatness of planes in the scene for a natural look. Our method formulates the alignment of images as the camera parameters and the normal vectors of planes. Given a set of feature point matches, a process of grouping points into different layers and rejecting outliers is introduced. According to the epipolar constraint of corresponding points in two images, the focal length and the pose change of the camera are recovered simultaneously. Then, the normal vectors are estimated from the point pairs. To achieve good alignment and guide the warping of images, the model is combined with the mesh deformation as a global similarity constraint. In addition, bundle adjustment is adopted to maintain the consistency for stitching multiple images. Experiment shows that the proposed approach outperforms some state-of-the-art warps on real-world scenes.

Abstract:
Co-saliency detection focuses on detecting common and salient objects among a group of images. With the application of deep learning in co-saliency detection, more accurate and more effective models are proposed in an end-to-end manner. However, two major drawbacks in these models hinder the further performance improvement of co-saliency detection: 1) the static manner-based inference, and 2) the constant quantity of input images. To address these limitations, we present a novel Adaptive Group-wise Consistency Network (AGCNet) with the ability of content-adaptive adjustment for a given image group with random quantity of images. In AGCNet, we first introduce intra-saliency priors generated from any off-the-shelf salient object detection model. Then, an Adaptive Group-wise Consistency (AGC) module is proposed to capture group consistency for each individual image, and is applied on three-scale features to capture the group consistency from different perspectives. This module is composed of two key components, where the content-adaptive group consistency block breaks the above limitations to adaptively capture the global group consistency with the assistance of intra-saliency priors and the ranking-based fusion block combines the consistency with individual attributes of each image feature to generate discriminative group consistency feature for each image. Following AGC modules, a specially designed Aggregated Decoder aggregates the three-scale group consistency features to adapt to co-salient objects with diverse scales for preliminary detection. Finally, we incorporate two normal decoders to progressively refine the preliminary detection and generate the final co-saliency maps. Extensive experiments on four benchmark datasets demonstrate that our AGCNet achieves competitive performance as compared with 19 state-of-the-art models, and the proposed modules experimentally show substantial practical merits.

Abstract:
Deep network-based image Compressed Sensing (CS) has attracted much attention in recent years. However, the existing deep network-based CS schemes either reconstruct the target image in a block-by-block manner that leads to serious block artifacts or train the deep network as a black box that brings about limited insights of image prior knowledge. In this paper, a novel image CS framework using non-local neural network (NL-CSNet) is proposed, which utilizes the non-local self-similarity priors with deep network to improve the reconstruction quality. In the proposed NL-CSNet, two non-local subnetworks are constructed for utilizing the non-local self-similarity priors in the measurement domain and the multi-scale feature domain respectively. Specifically, in the subnetwork of measurement domain, the long-distance dependencies between the measurements of different image blocks are established for better initial reconstruction. Analogically, in the subnetwork of multi-scale feature domain, the affinities between the dense feature representations are explored in the multi-scale space for deep reconstruction. Furthermore, a novel loss function is developed to enhance the coupling between the non-local representations, which also enables an end-to-end training of NL-CSNet. Extensive experiments manifest that NL-CSNet outperforms existing state-of-the-art CS methods, while maintaining fast computational speed.

Abstract:
Outfit compatibility modeling, which aims to automatically evaluate the matching degree of an outfit, has drawn great research attention. Regarding the comprehensive evaluation, several previous studies have attempted to solve the task of outfit compatibility modeling by integrating the multi-modal information of fashion items. However, these methods primarily focus on fusing the visual and textual modalities, but seldom consider the category modality as an essential modality. In addition, they mainly focus on the exploration of the intra-modal compatibility relation among fashion items in an outfit but ignore the importance of the inter-modal compatibility relation, i.e., the compatibility across different modalities between fashion items. Since each modality of the item could deliver the same characteristics of the item as other modalities, as well as certain exclusive features of the item, overlooking the inter-modal compatibility could yield sub-optimal performance. To address these issues, a multi-modal outfit compatibility modeling scheme with modality-oriented graph learning is proposed, dubbed as MOCM-MGL, which takes both the visual, textual, and category modalities as input and jointly propagates the intra-modal and inter-modal compatibilities among fashion items. Experimental results on the real-world Polyvore Outfits-ND and Polyvore Outfits-D datasets have demonstrated the superiority of our proposed model over existing methods.

Abstract:
Identifying persuasive speakers in an adversarial environment is a critical task. In a national election, politicians would like to have persuasive speakers campaign on their behalf. When a company faces adverse publicity, they would like to engage persuasive advocates for their position in the presence of adversaries who are critical of them. Debates represent a common platform for these forms of adversarial persuasion. This paper solves two problems: the Debate Outcome Prediction (DOP) problem predicts who wins a debate while the Intensity of Persuasion Prediction (IPP) problem predicts the change in the number of votes before and after a speaker speaks. Though DOP has been previously studied, we are the first to study IPP. Past studies on DOP fail to leverage two important aspects of multimodal data: 1) multiple modalities are often semantically aligned, and 2) different modalities may provide diverse information for prediction. Our \mathsfM2P2 (Multimodal Persuasion Prediction) framework is the first to use multimodal (acoustic, visual, language) data to solve the IPP problem. To leverage the alignment of different modalities while maintaining the diversity of the cues they provide, \mathsfM2P2 devises a novel adaptive fusion learning framework which fuses embeddings obtained from two modules – an alignment module that extracts shared information between modalities and a heterogeneity module that learns the weights of different modalities with guidance from three separately trained unimodal reference models. We test \mathsfM2P2 on the popular IQ2US dataset designed for DOP. We also introduce a new dataset called QPS (from Qipashuo, a popular Chinese debate TV show) for IPP. \mathsfM2P2 significantly outperforms 4 recent baselines on both datasets.

Abstract:
Graph-based multi-view clustering aiming to obtain a partition of data across multiple views, has received considerable attention in recent years. Although great efforts have been made for graph-based multi-view clustering, it is still challenging to fuse characteristics from various views to learn a common representation for clustering. In this paper, we propose a novel Consistent Multiple Graph Embedding Clustering framework (CMGEC). Specifically, a multiple graph auto-encoder (M-GAE) is designed to flexibly encode the complementary information of multi-view data using a multi-graph attention fusion encoder. To guide the learned common representation maintaining the similarity of the neighboring characteristics in each view, a Multi-view Mutual Information Maximization module (MMIM) is introduced. Furthermore, a graph fusion network (GFN) is devised to explore the relationship among graphs from different views and provide a common consensus graph needed in M-GAE. By jointly training these models, the common representation can be obtained, which encodes more complementary information from multiple views and depicts data more comprehensively. Experiments on three types of multi-view datasets demonstrate CMGEC outperforms the state-of-the-art clustering methods.

Abstract:
Supervised deep learning depends on massive accurately annotated examples, which is usually impractical in many real-world scenarios. A typical alternative is learning from multiple noisy annotators. Numerous earlier works assume that all labels are noisy, while it is usually the case that a few trusted samples with clean labels are available. This raises the following important question: how can we effectively use a small amount of trusted data to facilitate robust classifier learning from multiple annotators? This paper proposes a data-efficient approach, called Trustable Co-label Learning (TCL), to learn deep classifiers from multiple noisy annotators when a small set of trusted data is available. This approach follows the coupled-view learning manner, which jointly learns the data classifier and the label aggregator. It effectively uses trusted data as a guide to generate trustable soft labels (termed co-labels). A co-label learning can then be performed by alternately reannotating the pseudo labels and refining the classifiers. In addition, we further improve TCL for a special complete data case, where each instance is labeled by all annotators and the label aggregator is represented by multilayer neural networks to enhance model capacity. Extensive experiments on synthetic and real datasets clearly demonstrate the effectiveness and robustness of the proposed approach. Source code is available at https://github.com/ShikunLi/TCL.

Abstract:
This paper studies the blind image restoration where the ground truth is unavailable and the downsampling process is unknown. This complicated setting makes supervised learning and accurate kernel estimation impossible. Inspired by the recent success of image-to-image translation, this paper resorts to the unsupervised Cycle-consistent based framework to tackle this challenging problem. Different from the image-to-image task, the fidelity of reconstructed image is important for image restoration. Therefore, to improve the reconstruction ability of the Cycle-consistent network, we make explorations from the following aspects. First, we constrain low-frequency content in data to preserve the content of output from LR input. Second, we impose constraint on the content of training data to provide better supervision for discriminator, helping to suppress high-frequency artifacts or fake textures. Third, we average model parameters to further improve the generated image quality and help with model selection for GAN-based methods. Since GAN-based methods tend to produce various artifacts with different models, model average could realize a smoother control of balancing artifacts and fidelity. We have conducted extensive experiments on real noise and super resolution datasets to validate the effectiveness of the above techniques. The proposed ECycleGAN also demonstrates superior performance to SOTA methods in two applications – blind SR and blind denoising.

Abstract:
Generative Adversarial Networks (GAN) is a popular machine learning method that possesses powerful image generation ability, which is useful for different multimedia applications (e.g., photographic filters, image editing). However, typical GAN models have a large memory footprint that limits their practical applications for resource-constrained devices (e.g., smartphones). To deploy GAN models on devices with various hardware constraints, we propose our method, AdjustableGAN, which can compress a pretrained GAN model to different compression ratios. Our method compresses GAN by performing filter-wise pruning that follows these objectives: (1) deactivate convolutional filters for minimal performance decrease, (2) reactivate convolutional filters for maximal performance increase. We implement multiple Genetic Algorithms (GA) to perform each of these objectives— Downsize GA for best filter deactivations, while Upsize GA searches for best filter reactivations. By selective utilization of Upsize/Downsize GA, we could explicitly control the compression ratio of the model. For finalization, we fine-tune the compressed output model using the training dataset of the original input model. Our experimental results show that our method can reliably compress generative networks with minimal accuracy drop compared to other state-of-the-art compression algorithms.

Abstract:
3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local spacetime according to its kernel size, while human attention is always attracted by relational visual features at different time. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal subspaces. Extensive experiments demonstrate the contributions of key components of our method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets clearly prove the superiority of the proposed model compared with all state-of-the-art models.

Abstract:
Crowdfunding creates opportunities for entrepre- neurs. It allows startup companies to reach a large audience for fundraising and bring their creative ideas to life. In this work, we are concerned with crowdfunding project success prediction problem, i.e., to predict whether a project will successfully reach its funding goal by using its project profiles. This is important for startup companies to refine their project profiles and achieve their goals. Crowdfunding project success prediction is a typical classification problem but with a few critical challenges. On the one hand, with only coarse-grained project status as weak supervision, it is hard for a deep learning network to learn the relationship between project profiles and explain why it makes this prediction. On the other hand, on the project homepage, there are various modalities of description, including metadata, textual description, images, and videos. Among those, videos play an important role in the success of a crowdfunding project, however, were ignored in previous works, due to the difficulty in extracting useful semantic and authentic information from videos, especially for the crowdfunding project where information in different modalities are unaligned. To this end, we propose a novel framework called Deep Cross-Attention Network to learn and fuse information from introduction videos and textual descriptions of project profiles. More specifically, we develop a cross-attention block to align and represent mismatched textual description and untrimmed introduction videos and fuse the information from these two modalities, which effectively remedies the lack of supervised information caused by project status as weak supervision. More importantly, with our cross-attention mechanism, the model is able to interpret how it makes such predictions and show which keywords and keyframes it depends on. We conduct extensive experiments on two crowdfunding datasets (collected from Kickstarter and Indiegogo) and show that our method achieves superior performance over existing state-of-the-art baselines.

Abstract:
Current approaches for human pose estimation in videos can be categorized into per-frame and warping-based methods. Both approaches have their pros and cons. For example, per-frame methods are generally more accurate, but they are often slow. Warping-based approaches are more efficient, but the performance is usually not good. To bridge the gap, in this paper, we propose a novel fast framework for human pose estimation to meet the real-time inference with controllable accuracy degradation in compressed video domain. Our approach takes advantage of the motion representation (called “motion vector”) that is readily available in a compressed video. Pose joints in a frame are obtained by directly warping the pose joints from the previous frame using the motion vectors. We also propose modules to correct possible errors introduced by the pose warping when needed. Extensive experimental results demonstrate the effectiveness of our proposed framework for accelerating the speed of top-down human pose estimation in videos.

Abstract:
Zero-shot learning (ZSL) aims to transfer knowledge from seen classes to semantically related unseen classes, which are absent during training. The promising strategies for ZSL are to synthesize visual features of unseen classes conditioned on semantic side information and to incorporate meta-learning to eliminate the model’s inherent bias towards seen classes. While existing meta generative approaches pursue a common model shared across task distributions, we aim to construct a generative network adaptive to task characteristics. To this end, we propose an Attribute-Modulated generAtive meta-model for Zero-shot learning (AMAZ). Our model consists of an attribute-aware modulation network, an attribute-augmented generative network, and an attribute-weighted classifier. Given unseen classes, the modulation network adaptively modulates the generator by applying task-specific transformations so that the generative network can adapt to highly diverse tasks. The weighted classifier utilizes the data quality to enhance the training procedure, further improving the model performance. Our empirical evaluations on four widely-used benchmarks show that AMAZ outperforms state-of-the-art methods by 3.8% and 3.1% in ZSL and generalized ZSL settings, respectively, demonstrating the superiority of our method. Our experiments on a zero-shot image retrieval task show AMAZ’s ability to synthesize instances that portray real visual characteristics.

Abstract:
Zero-shot learning (ZSL) aims to recognize unknown categories that are unavailable during training. Recently, generative models have shown the potential to address this challenging problem by synthesizing unseen features conditioned on semantic embeddings such as attributes. However, unidirectional generative models cannot guarantee the effective coupling between visual and semantic spaces. To this end, we propose a visual-semantic aligned bidirectional network with cycle consistency to alleviate the gap between these two spaces, generating unseen features of high quality. More importantly, we incorporate two carefully designed strategies into our bidirectional framework to improve the overall ZSL performance. Specifically, we enhance the intra-domain class divergence in both visual and semantic spaces, and in the meantime, mitigate the inter-domain shift to preserve seen-unseen domain discrimination. Experimental results on four standard benchmarks show the superiority of our framework over existing state-of-the-art methods under both conventional and generalized ZSL settings.

Abstract:
Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a fully-labeled source domain to a different unlabeled target domain. Most existing UDA methods learn domain-invariant feature representations by minimizing feature distances across domains. In this work, we build upon contrastive self-supervised learning to align features so as to reduce the domain discrepancy between training and testing sets. Exploring the same set of categories shared by both domains, we introduce a simple yet effective framework CDCL, for domain alignment. In particular, given an anchor image from one domain, we minimize its distances to cross-domain samples from the same class relative to those from different categories. Since target labels are unavailable, we use a clustering-based approach with carefully initialized centers to produce pseudo labels. In addition, we demonstrate that CDCL is a general framework and can be adapted to the data-free setting, where the source data are unavailable during training, with minimal modification. We conduct experiments on two widely used domain adaptation benchmarks, i.e., Office-31 and VisDA-2017, for image classification tasks, and demonstrate that CDCL achieves state-of-the-art performance on both datasets.

Abstract:
This paper investigates an emerging and challenging task—emotional video captioning. Formally, given a video, the task aims to not only describe the factual content of the video, but also discover the emotional clues in the video. We propose a novel Contextual Attention Network (CANet), which recognizes and describes the fact and emotion in the video by semantic-rich context learning. To be specific, at each time step, we first extract visual and textual features from both input video and previously generated words. Then, we apply the attention mechanism to these features to capture informative contexts for captioning. We train the CANet model with the joint optimization of cross-entropy loss \mathcal L_CE and contrastive loss \mathcal L_CL, where \mathcal L_CE constrains the semantics of the generated sentence to be close to human annotation and \mathcal L_CL encourages discriminative representation learning from positive and negative pairs of video and caption. Experiments on two emotional video captioning datasets (i.e., EmVidCap and EmVidCap-S) demonstrate the superiority of CANet compared to the state-of-the-art approaches.

Abstract:
Product detection, which aims to localize products of interest in the advertising images, helps advance many potential E-commerce applications like product retrieval and recommendation. However, labeling a massive number of fine-grained product categories and accurate product boxes is costly and especially not practical since products are ever-changing on E-commerce websites. In this work, we step forward to train a fine-grained product detector solely supervised by the advertising captions, which are naturally available but often severely flawed and noisy. To reformulate the weakly supervised detection research into a real-world setting, we introduce a large-scale benchmark, named CapProduct, where more than 80,000 product image-caption pairs are collected from E-commerce websites. The fine-grained nature of products and noisy captions in CapProduct make it intractable to excavate valid category labels to train a weakly supervised object detector. To tackle this challenge, we propose a Collaborative Pseudo-Label Harmonization (CoPLH) framework that harmonizes self-mined pseudo labels via modeling the global co-occurrence relationships of products. We construct a collaborative co-occurrence graph based on all training samples to improve the reliability of caption-predicted pseudo-labels as well as benefit the self-training procedure in a weakly supervised setting. Extensive experiments on the CapProduct dataset demonstrate the effectiveness and the superiority of the proposed CoPLH over the state-of-the-art baselines.

Abstract:
A number of image compressive sensing (CS) algorithms were proposed in the past two decades, aiming at yielding recovered images with the best possible visual effect. However, it is quite difficult to further improve the image quality for human eyes. For example, in the low-rate sampling scenarios, CS algorithms always suffer degraded performance and can only recover less visually appealing images. We notice that what human beings concern with is the visual quality of an image, while machine users care much more about its latent metrics, such as recognition accuracy, rather than the subjective visual effect. Inspired by this point, we develop a machine recognition-oriented image CS with an adversarial learning strategy. Some adversarial models are investigated to make the recognition accuracy as an additional optimization goal of the CS reconstruction network. Through end-to-end training, CS reconstruction network automatically learns an image recognition pattern, and produce recovered images owning extra recognition metric, which makes them become more suited for machine users. Experimental results indicate that the images recovered with the proposed adversarial learning strategy can be recognized with significantly higher accuracy compared to that with the existing CS algorithms.

Abstract:
In this paper, we propose a deep learning based sensor-driven method for online video stabilization. This method utilizes the Euler angles and acceleration values estimated from the gyroscope and accelerator to assist stable video reconstruction. We introduce two simple sub-networks for trajectory optimization. The first network exploits real unstable trajectories and camera acceleration values to detect shooting scenarios. This network also generates an attention mask to adaptively choose scenario-specific features. Then the second network predicts smooth camera paths based on real unstable trajectories using long short-term memory (LSTM) under the supervision of the above mask. The output of the trajectory optimization network is filtered with a two-step modification process to guarantee smoothness. The real and smoothed camera paths are then utilized as guidance to generate stable frames in a projective manner. We also capture videos with sensor data covering seven typical shooting scenarios and design a ground truth generation method to construct pseud-labels. Moreover, the trajectory smoothing network allows the use of 3- or 10-frame buffers as future information to construct a lookahead filter. Experimental results show that our online method could outperform other state-of-the-art offline methods in several shaky video clips with fewer buffer frames for both general and low-quality videos. Furthermore, our method could effectively reduce running times without performing image content analysis, and the stabilization efficiency reaches 25 fps on 1080p videos.

Abstract:
Object detection methods based on Convolution Neural Networks (CNN) usually utilize feature pyramid networks to detect objects with various scales. The state-of-the-art feature pyramid networks improve detection accuracy by enhancing multi-level feature representations. Fusing multi-level features is the most effective manner to enhance the feature representations. However, the existing feature pyramid networks usually fuse multi-level features by element-wise operations. It leads to the lack of long-range dependencies in the feature fusion. To address the problem, we propose a simple yet efficient feature pyramid network named latent feature pyramid network (LFPN). LFPN can enhance the feature representations by modeling inner-scale and cross-scale long-range dependencies through conducting inner-scale and cross-scale feature fusion in the latent space. Comprehensive experiments are performed on two challenge object detection datasets: MS COCO and Pascal VOC. The experimental results show consistent improvements on various feature pyramid networks, backbones, and object detectors, which demonstrates the effectiveness and generality of our LFPN.

Abstract:
In generative adversarial network (GAN) based zero-shot learning (ZSL) approaches, the synthesized unseen visual features are inevitably prone to seen classes since the feature generator is merely trained on seen references, which causes the inconsistency between visual features and their corresponding semantic attributes. This visual-semantic inconsistency is primarily induced by the non-preserved semantic-relevant components and the non-rectified semantic-irrelevant low-level visual details. Existing generative models generally tackle the issue by aligning the distribution of the two modalities with an additional visual-to-semantic embedding, which tends to cause the hubness problem and ruin the diversity of visual modality. In this paper, we propose a novel generative model named learning modality-consistent latent representations GAN (LCR-GAN) to address the problem via embedding the visual features and their semantic attributes into a shared latent space. Specifically, to preserve the semantic-relevant components, the distributions of the two modalities are aligned by maximizing the mutual information between them. And to rectify the semantic-irrelevant visual details, the mutual information between original visual features and their latent representations is confined within an appropriate range. Meanwhile, the latent representations are decoded back to both modalities to further preserve the semantic-relevant components. Extensive evaluations on four public ZSL benchmarks validate the superiority of our method over other state-of-the-art methods.

Abstract:
LiDAR-based 3D single object tracking is a challenging issue in robotics and autonomous driving. Currently, existing approaches usually suffer from the problem that objects at long distance often have very sparse or partially-occluded point clouds, which makes the features extracted by the model ambiguous. Ambiguous features will make it hard to locate the target object and finally lead to bad tracking results. To solve this problem, we utilize the powerful Transformer architecture and propose a Point-Track-Transformer (PTT) module for point cloud-based 3D single object tracking task. Specifically, PTT module generates fine-tuned attention features by computing attention weights, which guides the tracker focusing on the important features of the target and improves the tracking ability in complex scenarios. To evaluate our PTT module, we embed PTT into the dominant method and construct a novel 3D SOT tracker named PTT-Net. In PTT-Net, we embed PTT into the voting stage and proposal generation stage, respectively. PTT module in the voting stage could model the interactions among point patches, which learns context-dependent features. Meanwhile, PTT module in the proposal generation stage could capture the contextual information between object and background. We evaluate our PTT-Net on KITTI and NuScenes datasets. Experimental results demonstrate the effectiveness of PTT module and the superiority of PTT-Net, which surpasses the baseline by a noticeable margin, ～10% in the Car category. Meanwhile, our method also has a significant performance improvement in sparse scenarios. In general, the combination of transformer and tracking pipeline enables our PTT-Net to achieve state-of-the-art performance on both two datasets. Additionally, PTT-Net could run in real-time at 40FPS on NVIDIA 1080Ti GPU. Our code is open-sourced for the research community at https://github.com/shanjiayao/PTT.

Abstract:
Body weight, as one of the biometric traits, has been studied in both the forensic and medical domains. However, estimating weight directly from 2-D images is particularly challenging since visual inspection is rather sensitive to the distance between the subject and camera, even for frontal view images. In this case, the widely used body mass index (BMI), which is associated with body height and weight, can be employed as a measure of weight to indicate health conditions. Previous works on the estimation of BMI have predominantly focused on using multiple 2-D images, 3-D images, or facial images; however, these cues are not always available. To address this issue, we explore the feasibility of obtaining BMI from a single 2-D body image with the dual-branch regression framework proposed in this work. More specifically, the framework comprises an anthropometric feature computation branch and a deep learning-based feature extraction branch. One aggregation layer maps all the features to an estimated BMI value. In addition, a new public 2-D image-to-BMI dataset, which contains 4189 images (1477 males and 2712 females) from approximately 3000 subjects with attributes including gender, age, height, and weight, was collected and released to facilitate the study. Extensive experiments confirm that the proposed framework combining anthropometric features and deep features outperforms the single-type feature approaches to BMI estimation in most cases.

Abstract:
Recent years have witnessed the popularity of using a two-stream architecture and attention mechanism for action recognition with videos. However, it is time-consuming to train two separate convolutional neural networks (ConvNets), especially with the complex attention mechanism. In this article, we present a novel architecture, termed as Appearance-Motion Fusion Network (AMFNet), to learn efficient and robust action representation from RGB and optical flow data in an end-to-end manner. AMFNet is constructed by connecting a convolutional neural network with an appearance-motion fusion block (AMFB), whose goal is to incorporate appearance and motion streams into a unified framework driven by a cross-modality attention (CMA) mechanism. More specifically, the CMA only relies on optical flow data, which consists of a Key-Frame Adaptive Selection Module (KFASM) and an Optical-Flow-Driven Spatial Attention Module (OFDSAM). The former aims to adaptively identify the discriminative key frames from a sequence, while the latter is able to guide our networks to focus on the action-relevant regions of each frame. We explore two schemes for appearance and motion streams fusion in AMFB from hierarchical and comprehensive levels. The proposed AMFNet is extensively evaluated on five action recognition data sets, including HMDB-51, UCF-101, JHMDB, Penn and Kinetics-400. Compared to the state-of-the-art methods operated at RGB and optical flow, the experimental results validate that our AMFNet achieves a comparable performance with a pure 2D-Single-ConvNet design.

Abstract:
Video unscreen, a technique to extract foreground from given videos, has been playing an important role in today’s video production pipeline. Existing systems developed for this purpose which mainly rely on video segmentation or video matting, either suffer from quality deficiencies or require tedious manual annotations. In this work, we aim to develop a fully automatic video unscreen framework that is able to obtain high-quality foreground extraction without the need of human intervention in a controlled environment. Our framework adopts a coarse-to-fine strategy, where the obtained background estimate given an initial mask prediction in turn helps the refinement of the mask by the alpha composition equation. We conducted experiments on two datasets, 1) the Adobe’s Synthetic-Composite dataset, and 2) DramaStudio, our newly collected large-scale green screen video matting dataset, exhibiting the controlled environments. The results show that the proposed framework outperforms existing algorithms and commercial software, both quantitatively and qualitatively. We also demonstrate its utility in person replacement in videos, which can further support a variety of video editing applications.

Abstract:
Influencer marketing is emerging as a new marketing method, changing the marketing strategies of brands profoundly. In order to help brands find suitable micro-influencers as marketing partners, the micro-influencer recommendation is regarded as an indispensable part of influencer marketing. However, previous works only focus on modeling the individual image of brands/micro-influencers, which is insufficient to represent the characteristics of brands/micro-influencers over the marketing scenarios. In this case, we propose a micro-influencer ranking joint learning framework which models brands/micro-influencers from the perspective of individual image, target audiences, and cooperation preferences. Specifically, to model accounts’ individual image, we extract topics information and images semantic information from historical content information, and fuse them to learn the account content representation. We introduce target audiences as a new kind of marketing role in the micro-influencer recommendation, in which audiences information of brand/micro-influencer is leveraged to learn the multi-modal account audiences representation. Afterward, we build the attribute co-occurrence graph network to mine cooperation preferences from social media interaction information. Based on account attributes, the cooperation preferences between brands and micro-influencers are refined to attributes’ co-occurrence information. The attribute node embeddings learned in the attribute co-occurrence graph network are further utilized to construct the account attribute representation. Finally, the global ranking function is designed to generate ranking scores for all brand-micro-influencer pairs from the three perspectives jointly. The extensive experiments on a publicly available dataset demonstrate the effectiveness of our proposed model over the state-of-the-art methods.

Abstract:
Semantic portrait synthesis has drawn consistent attention and has made significant progress, yet achieving style diversity and semantic controllability simultaneously is still a challenge. Existing methods either 1) directly take a semantic label map as input, ignoring various possibilities of semantic styles, or 2) sample global noise as input, ignoring controllability of local semantics. To fill this gap, we propose semantic-aware noise, a simple but effective input that tackles both issues and shows improved results over baselines. Semantic-aware noise introduces semantic information into noise, and each semantic is sampled from the noise separately, combining the semantic controllability and the noise sampling diversity. To further expand and manipulate real images, we propose a novel ternary network structure, allowing simultaneous diverse semantic image synthesis and real image manipulation in a unified framework. Extensive experiments demonstrate that the proposed method achieves quantitatively superior and perceptually pleasing results compared to state-of-the-art methods. We also analyze the performance of our method with respect to different noise structures and real-life applications in diverse synthesis, interactive manipulation, and extreme pose scenarios.

Abstract:
Imitation filming has been applied to autonomous filming by mimicking human operators. To imitate the operation of cameramen when filming multiple human actions, existing methods plan the camera motion through time series prediction or train multiple models to handle a particular style in a specific situation. As a result, these methods require various settings to adapt to different scenarios. In this work, we overcome such limitations and propose an end-to-end imitation learning framework for drone cinematography systems. The framework consists of two main components: (1) an efficient motion feature extraction module for generating a compact motion feature space, (2) a path-analysis-based reinforcement learning (PABRL) algorithm for imitating multiple filming styles from demonstrations and incorporating aesthetical features for improved perspective shots. Our PABRL method is based on the actor–critic network, which regards multiple human motion variables, camera translations, and image composition as inputs and then outputs an aesthetical filming strategy related to the subject motion. In addition, we propose an attention mechanism and a long–short-term rewarding function to enhance the motion feature space and the integrity of the generated trajectory, respectively. Extensive experimental results in simulated and real outdoor environments demonstrate that compared with state-of-the-art methods, our method can achieve 69.8% higher performance in terms of trajectory planning accuracy while successfully incorporating aesthetical features into the captured videos.

Abstract:
Existing convolutional neural networks (CNN) based image super-resolution (SR) methods have achieved impressive performance on bicubic kernel, which is not valid to handle unknown degradations in real-world applications. Recent blind SR methods suggest to reconstruct SR images relying on blur kernel estimation. However, their results still remain visible artifacts and detail distortion due to the estimation errors. To alleviate these problems, in this paper, we propose an effective and kernel-free network, namely DSSR, which enables recurrent detail-structure alternative optimization without blur kernel prior incorporation for blind SR. Specifically, in our DSSR, a detail-structure modulation module (DSMM) is built to exploit the interaction and collaboration of image details and structures. The DSMM consists of two components: a detail restoration unit (DRU) and a structure modulation unit (SMU). The former aims at regressing the intermediate HR detail reconstruction from LR structural contexts, and the latter performs structural contexts modulation conditioned on the learned detail maps at both HR and LR spaces. Besides, we use the output of DSMM as the hidden state and design our DSSR architecture from a recurrent convolutional neural network (RCNN) view. In this way, the network can alternatively optimize the image details and structural contexts, achieving co-optimization across time. Moreover, equipped with the recurrent connection, our DSSR allows low- and high-level feature representations complementary by observing previous HR details and contexts at every unrolling time. Extensive experiments on synthetic datasets and real-world images demonstrate that our method achieves the state-of-the-art against existing methods.

Abstract:
We study cross-modal recommendation of musictracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to significantly improve the system’s performance using structure-aware recommendation. The core idea is to consider not only the full audio-video clips, but rather shorter segments for training and inference. We find that using semantic segments and ranking the tracks according to sequence alignment costs significantly improves the results. We investigate the impact of different ranking metrics and segmentation methods.

Abstract:
Conversational recommendation system (CRS) attracts increasing attention in various application domains such as retail and travel. It offers an effective way to capture users’ dynamic preferences with multi-turn conversations. However, most current studies center on the recommendation aspect while over-simplifying the conversation process. The negligence of complexity in data structure and conversation flow hinders their practicality and utility. In reality, there exist various relationships among slots and values, while users’ requirements may dynamically adjust or change. Moreover, the conversation often involves visual modality to facilitate the conversation. These actually call for a more advanced internal state representation of the dialogue and a proper reasoning scheme to guide the decision making process. In this paper, we explore multiple facets of multimodal conversational recommendation and try to address the above mentioned challenges. In particular, we represent the structured back-end database as a multimodal knowledge graph which captures the various relations and evidence in different modalities. The user preferences expressed via conversation utterances will then be gradually updated to the state graph with clear polarity. Based on these, we train an end-to-end State Graph-based Reasoning model (SGR) to perform reasoning over the whole state graph. The prediction of our proposed model benefits from the structure of the graph. It not only allows for zero-shot reasoning for items unseen in training conversations, but also provides a natural way to explain the policies. Extensive experiments show that our model achieves better performance compared with existing methods.

Abstract:
The scene graph is a structured semantic representation of an image, which represents objects and relationships with vertices and edges, respectively. Since it is impossible to manually label all potential relationships in the real world, some previous methods try to apply the zero-shot method for scene graph generation. However, existing methods take triplet (i.e., \langle subject-predicate-object \rangle) as the basic unit of a relationship. Each element (i.e., subject, predicate, or object) of the unseen relationship is actually seen in the training data. Therefore, they ignore the unseen predicate. To predict the unseen predicate, we introduce a novel task named zero-shot predicate prediction, which is crucial to extending existing scene graph generation methods to recognize more relationship classes. The new task is challenging and cannot be simply resolved through conventional zero-shot learning methods because there is a large intra-class variation of each predicate. Firstly, the large intra-class variation leads to the difficulty of computing the discriminative instance-level feature of the predicate class. Secondly, the large intra-class variation also brings more difficulties when knowledge is transferred from seen classes to unseen classes. For the first challenge, we propose distilling lexical knowledge of different objects and construct multi-modal representations of pairwise objects to reduce the intra-class variation of the predicate. To respond to the second challenge, we build a compact semantic space where the representations of unseen classes are reconstructed based on the seen classes for zero-shot predicate classification. We evaluate the proposed method on the public dataset Visual Genome. The extensive experiment results under the zero-shot/few-shot/supervised settings demonstrate the effectiveness of the proposed method.

Abstract:
Occlusion poses a major challenge for person re-identification (ReID). Existing approaches typically rely on outside tools to infer visible body parts, which may be suboptimal in terms of both computational efficiency and ReID accuracy. In particular, they may fail when facing complex occlusions, such as those between pedestrians. Accordingly, in this paper, we propose a novel method named Quality-aware Part Models (QPM) for occlusion-robust ReID. First, we propose to jointly learn part features and predict part quality scores. As no quality annotation is available, we introduce a strategy that automatically assigns low scores to occluded body parts, thereby weakening the impact of occluded body parts on ReID results. Second, based on the predicted part quality scores, we propose a novel identity-aware spatial attention (ISA) module. In this module, a coarse identity-aware feature is utilized to highlight pixels of the target pedestrian, so as to handle the occlusion between pedestrians. Third, we design an adaptive and efficient approach for generating global features from common non-occluded regions with respect to each image pair. This design is crucial, but is often ignored by existing methods. QPM has three key advantages: 1) it does not rely on any outside tools in either the training or inference stages; 2) it handles occlusions caused by both objects and other pedestrians; 3) it is highly computationally efficient. Experimental results on four popular databases for occluded ReID demonstrate that QPM consistently outperforms state-of-the-art methods by significant margins. The code of QPM is available at https://github.com/Wang-pengfei/QPM.

Abstract:
Occlusion is a challenging yet commonly seen problem for facial perception. Existing works resort to deep learning models and perform model training on synthesized data due to the lack of paired real-world data. As a result,they usually perform unsatisfactorily on real-world occluded faces because of domain gaps. In this paper, we decompose the face de-occlusion task into three stages, i.e., occlusion detection, face parsing, and face reconstruction, to alleviate this issue. We first perform occlusion detection and use its results as guidance for the second stage to conduct occlusion-free face parsing. As such, face de-occlusion is first performed on the face paring space with less difficulty. We can train these two stages on both synthesized and real-world images, hence can obtain accurate results for the latter. In the last stage, we use the domain-agnostic occlusion detection map and the face parsing map as the guidance to conduct face reconstruction, thus can reduce the impact of appearance information and improve the model performance on real-world data. Aiming at improving the model capacity of inferring occluded facial appearance, we also propose two types of reference modules to use relevant facial parts to enhance the reconstruction of occluded regions. Consequently, our proposed model achieves promising face de-occlusion results on real-world images.

Abstract:
More and more users are getting used to posting images and text on social networks to share their emotions or opinions. Accordingly, multimodal sentiment analysis has become a research topic of increasing interest in recent years. Typically, there exist affective regions that evoke human sentiment in an image, which are usually manifested by corresponding words in people's comments. Similarly, people also tend to portray the affective regions of an image when composing image descriptions. As a result, the relationship between image affective regions and the associated text is of great significance for multimodal sentiment analysis. However, most of the existing multimodal sentiment analysis approaches simply concatenate features from image and text, which could not fully explore the interaction between them, leading to suboptimal results. Motivated by this observation, we propose a new image-text interaction network (ITIN) to investigate the relationship between affective image regions and text for multimodal sentiment analysis. Specifically, we introduce a cross-modal alignment module to capture region-word correspondence, based on which multimodal features are fused through an adaptive cross-modal gating module. Moreover, considering the complementary role of context information on sentiment analysis, we integrate the individual-modal contextual feature representations for achieving more reliable prediction. Extensive experimental results and comparisons on public datasets demonstrate that the proposed model is superior to the state-of-the-art methods.

Abstract:
Domain generalization in person re-identification is a highly important meaningful and practical task in which a model trained with data from several source domains is expected to generalize well to unseen target domains. Domain adversarial learning is a promising domain generalization method that aims to remove domain information in the latent representation through adversarial training. However, in person re-identification, the domain and class are correlated, and we theoretically show that domain adversarial learning will lose certain information about class due to this domain-class correlation. Inspired by causal inference, we propose to perform interventions to the domain factor d, aiming to decompose the domain-class correlation. To achieve this goal, we proposed estimating the resulting representation z^ caused by the intervention through first- and second-order statistical characteristic matching. Specifically, we build a memory bank to restore the statistical characteristics of each domain. Then, we use the newly generated samples \lbrace z^,y,d^\rbrace to compute the loss function. These samples are domain-class correlation decomposed; thus, we can learn a domain-invariant representation that can capture more class-related features. Extensive experiments show that our model outperforms the state-of-the-art methods on the large-scale domain generalization Re-ID benchmark.

Abstract:
Due to the difficulty of annotating large amounts of training data, directly learning the association of sound and its makers in natural videos is a challenging task for machines. In this paper, we present a novel audiovisual model that introduces a soft-clustering module as the audio and visual content detector, and regards the pervasive property of audiovisual concurrency as the latent supervision for inferring the correlation among detected contents. Furthermore, we discover for the first time that the complexity of data has an impact on the training efficiency and subsequent performance of audiovisual model, i.e., more complex data brings more obstacles to the model training, and degrades the performance of downstream audiovisual tasks. To address the issue of audiovisual learning, we propose a novel heterogeneous audiovisual scene analysis module that trains the model from simple to complex scene. We show that such ordered learning procedure rewards the model the merits of easy training and fast convergence. Meanwhile, our audiovisual model can also provide effective unimodal representation and cross-modal alignment performance. We further deploy the well-trained model into practical audiovisual sound localization and separation tasks. We show that our localization model significantly outperforms existing methods, based on which we show comparable performance in sound separation task by comparison to several related SOTA audiovisual learning methods without referring external visualsupervision.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a feature embedding network to project heterogeneous features to the same feature space. However, many studies based on the existing pre-trained models neglect potential correlations between different locations and channels within a single sample during the feature extraction. Inspired by the success of the Transformer in computer vision, we extend it to enhance feature representation for VI-ReID. In this paper, we propose a discriminative feature learning network based on a visual Transformer (DFLN-ViT) for VI-ReID. Firstly, to capture long-term dependencies between different locations, we propose a spatial feature awareness module (SAM), which utilizes a single-layer Transformer with a novel patch-embedding strategy to encode location information. Secondly, to refine the representation at each channel, we design a channel feature enhancement module (CEM). The CEM treats the features of each channel as a sequence of Transformer inputs, taking advantage of the Transformer's ability to model long-term dependencies. Finally, we propose a Triplet-aided Hetero-Center (THC) loss to learn more discriminative feature representation by balancing the cross-modality distance and intra-modality distance of the center. The experimental results on two datasets show that our method can significantly improve the VI-ReID performance, outperforming most state-of-the-art methods.

Abstract:
Temporal action detection is a challenging task in video understanding, which is usually divided into two stages: proposal generation and classification. Learning proposal features is a crucial step for both stages. However, most methods ignore temporal information of proposals and consider background and action frames in proposals equally, leading to poor proposal features. In this paper, we propose a novel Temporal Attention-Pyramid Pooling (TAPP) method to learn proposal features of arbitrary length action proposals. The TAPP method exploits the attention mechanism to focus on the discriminative part of proposals, suppressing background influence on proposal features. It constructs a temporal pyramid structure to convert arbitrary length proposal feature sequences to multiple fixed-length sequences while retaining the temporal information. In the TAPP method, we design a multi-scale temporal function and apply it to the temporal pyramid to generate final proposal features. Based on the TAPP method, we construct a temporal action proposal generation model and an action proposal classification model, and then we perform extensive experiments on two mainstream temporal action detection datasets for the temporal action proposal and temporal action detection tasks to verify our models. On the THUMOS’14 dataset, our models based on the TAPP significantly outperform the previous state-of-the-art methods for both tasks.

Abstract:
Sight-singing exercises are a fundamental part of music education. In this paper, we present an objective and complete automatic evaluation system for sight-singing, which has two critical stages: note transcription and note alignment. In the first stage, we use an onset detector based on the convolutional recurrent neural network (CRNN) for note segmentation and the pitch extractor described in (Kim et al. 2018) for note labeling. In the second stage, an alignment algorithm based on relative pitch modeling is proposed. Due to the lack of datasets for sight-singing note alignment and the overall system evaluation, we construct the sight-singing vocal dataset (SSVD). Each module of the system and the entire system are tested on this dataset. The onset detector achieves an F-measure of 90.61%, and the stages of note transcription and note alignment achieve an F-measure of 88.42% and 94.79%, respectively. In addition, we propose an objective criterion for the sight-singing evaluation system. Based on this criterion, our automatic sight-singing system achieves an F-measure of 77.95% on the SSVD dataset.

Abstract:
Simultaneous localization and mapping via LiDAR-Inertial fusion is a crucial technology in many automation-related applications. Recently, a number of approaches based on geometric features have evolved, yielding impressive results via tightly-coupled estimation. This sort of feature-based techniques, however, are inextricably linked to the scanning mechanism of the LiDAR, relying on stable feature detection, and thus are difficult to adapt to multi-LiDAR systems. A few “direct” solutions, on the other hand, register the raw point cloud with the built probability map, which is more computationally efficient and easy to be extended. But, the existing direct approaches are all loosely-coupled, lacking correction of the IMU biases, and thus only work well in 2D cases. To this end, we present D-LIOM, a tightly-coupled Direct LiDAR-Inertial Odometry and Mapping framework. In D-LIOM, a scan is directly registered to a probability submap, and the LiDAR odometry, the IMU pre-integration, and the gravity constraint are integrated to build a local factor graph in the submap's time window, allowing the system to perform real-time high-precision pose estimation. Furthermore, to eliminate accumulated errors in time, we detect loops and adjust the sparse pose graph based on mutual matching of projected 2D submaps, allowing D-LIOM to run stably in large-scale scenes. In addition, to improve its flexibility to varied sensor combinations, D-LIOM supports multi-LiDAR inputs and facilitates the initialization with a common 6-axis IMU. Extensive experiments demonstrate that D-LIOM largely outperforms the existing state-of-the-art counterparts in mapping effect and localization accuracy as well as with high time efficiency. Lastly, to ensure that our results are entirely reproducible, all necessary data and codes are made open-source available. One introduction video can also be found on the online website.

Abstract:
Self-attention (SA) based networks have achieved great success in image captioning, constantly dominating the leaderboards of online benchmarks. However, existing SA networks still suffer from distance insensitivity and low-rank bottleneck. In this paper, we aim to optimize SA in terms of two aspects, thereby addressing the above issues. First, we introduce a Distance-sensitive Self-Attention (DSA), which considers the raw geometric distances between query-key pairs in the 2D images during SA modeling. Second, we present a simple yet effective approach, named Multi-branch Self-Attention (MSA) to compensate for the low-rank bottleneck. MSA treats a multi-head self-attention layer as a branch and duplicates it multiple times to increase the expressive power of SA. To validate the effectiveness of the two designs, we apply them to the standard self-attention network, and conduct extensive experiments on the highly competitive MS-COCO dataset. We achieve new state-of-the-art performance on both the local and online test sets, i.e., 135.1% CIDEr on the Karpathy split and 135.4% CIDEr on the official online split.

Abstract:
For a long time, the local descriptors learning benefited from the use of L2 normalization, which projects the descriptor space onto the hypersphere. However, there is no free lunch in the world. Although hypersphere description space stabilizes the optimization and improves the repeatability of the descriptors, it causes the descriptors to have a denser distribution, which reduces the discrimination between descriptors and leads to some incorrect matches. To alleviate this problem, we propose the learnable cross normalization technology as an alternative to L2 normalization, which can achieve a consistent improvement in several of the current popular local descriptors. In addition, we propose an ER-Backbone that can efficiently reuse features in descriptors extraction and an IDC Loss that can provide an image-level description space distribution consistency constraint to further stimulate the performance of the local descriptors. Based on the above innovations, we provide a novel local descriptors extraction method named CNDesc. We perform experiments on image matching, homography estimation, 3D reconstruction, and visual localization tasks, and the results demonstrate that our CNDesc surpasses the current state-of-the-art local descriptors. Our code is available at https://github.com/vignywang/CNDesc.

Abstract:
Local visual and long-range contextual features yield two complementary cues for human reading text in natural scene. Existing scene text recognition methods mainly extract local features at a low level and then model long-range dependencies at a high level, this sequential pipeline may be sub-optimal to construct complete and effective representation. Except for high-level features, long-range contextual relation is of importance in low-level features as well since it can help separate different characters based on the intervals between characters and thus enhance the character features. To address this issue, we develop a dual relation module to extract complementary features in a parallel manner for scene text recognition, which consists of a local visual branch and a long-range contextual branch. The local visual branch employs a topological-aware operation to model intra-character characteristic and extract discriminative features of different characters. Meanwhile, the long-range contextual branch utilizes a simple but effective strategy to incorporate inter-character relations into feature maps. Our dual relation module is a plug-and-play block which can be easily incorporated into modern deep architectures. Experimental results demonstrate that our methods achieved top performance on several standard benchmarks. Code and models will become publicly available in the future.

Abstract:
Omnidirectional images (ODIs) have recently attracted extensive attention from both academia and industry. However, due to storage and transmission limitations, ODIs are usually at extremely low resolution. Thus, it is necessary to restore a high-resolution ODI from a low-resolution ODI, i.e., omnidirectional image super-resolution (ODI-SR). Towards ODI-SR, we propose in this paper a novel latitude-aware upscaling network, namely LAU-Net+, which fully considers the above characteristics of ODIs. In our network, different latitude bands can learn to adopt distinct upscaling factors, which significantly saves the computational resources and improves the SR efficiency. Specifically, a Laplacian multilevel pyramid network is introduced in which the upscaling factor is gradually increased with the number of levels. Each level is composed of a feature enhancement module (FEM), a drop-band decision module (DDM) and a high-latitude enhancement module (HEM). The FEM module serves to enhance the high-level features extracted from the input ODI, while the role of DDM is to dynamically drop the unnecessary high latitude bands and send the remained bands to the next level. The HEM is adopted to further enhance high-level features of dropped latitude bands with a lightweight architecture. In DDM, we develop a reinforcement learning scheme with a latitude adaptive reward to determine which band should be dropped. To the best of our knowledge, our method is the first work which considers the latitude characteristics for ODI-SR task. Extensive experimental results demonstrate that our LAU-Net+ achieves state-of-the-art results on ODI-SR both quantitatively and qualitatively on various ODI datasets.

Abstract:
Unsupervised person re-identification (Re-ID) aims to learn discriminative features without human-annotated labels. Recently, contrastive learning has provided a new prospect for unsupervised person Re-ID, and existing methods primarily constrain the feature similarity among easy sample pairs. However, the feature similarity among hard sample pairs is neglected, which yields suboptimal performance in unsupervised person Re-ID. In this paper, we propose a novel Hybrid Contrastive Model (HCM) to perform the identity-level contrastive learning and the image-level contrastive learning for unsupervised person Re-ID, which adequately explores feature similarities among hard sample pairs. Specifically, for the identity-level contrastive learning, an identity-based memory is constructed to store pedestrian features. Accordingly, we define the dynamic contrast loss to identify identity information with dynamic factor for distinguishing hard/easy samples. As for the image-level contrastive learning, an image-based memory is established to store each image feature. We design the sample constraint loss to explore the similarity relationship between hard positive and negative sample pairs. Furthermore, we optimize the two contrastive learning processes in one unified framework to make use of their own advantages as so to constrain the feature distribution for extracting potential information. Extensive experiments demonstrate that the proposed HCM distinctly outperforms existing methods.

Abstract:
Although the importance of sleep is increasingly recognized, the lack of general and transferable algorithms hinders scalable sleep assessment in healthy persons and those with sleep disorders. A deep understanding of the sleep posture, state, or stage is the premise of diagnosing and treating sleep diseases. At present, most existing methods draw support from supervised learning to monitor the whole sleep process. However, in the absence of sufficient labeled sleep data, it is difficult to guarantee the reliability of sleep recognition networks. To solve this problem, we propose a transferable self-supervised instance learning model for three sleep recognition tasks, i.e., sleep posture, state, and stage recognition. Firstly, a SleepGAN is designed to generate sleep data, and then, we combine enough self-supervised rotating sleep data and original data for non-parametric classification at the instance-level, finally, different sleep postures, states, or stages can be distinguished precisely. The proposed model can be applied to multimodal sleep data such as signals and images, and makeup for the inaccuracy caused by insufficient data, and can be transferred to sleep datasets of different sizes. The experimental results show that our algorithm for the physiological changes in the sleep process is superior to several state-of-the-art studies, which may be helpful to promote the intelligence of sleep assessment and monitoring.

Abstract:
We present a model for predicting visual attention during the free viewing of graphic design documents. While existing works on this topic have aimed at predicting static saliency of graphic designs, our work is the first attempt to predict both spatial attention and dynamic temporal order in which the document regions are fixated by gaze using a deep learning based model. We propose a two-stage model for predicting dynamic attention on such documents, with webpages being our primary choice of document design for demonstration. In the first stage, we predict the saliency maps for each of the document components (e.g. logos, banners, texts, etc. for webpages) conditioned on the type of document layout. These component saliency maps are then jointly used to predict the overall document saliency. In the second stage, we use these layout-specific component saliency maps as the state representation for an inverse reinforcement learning model of fixation scanpath prediction during document viewing. To test our model, we collected a new dataset consisting of eye movements from 41 people freely viewing 450 webpages (the largest dataset of its kind). Experimental results show that our model outperforms existing models in both saliency and scanpath prediction for webpages, and also generalizes very well to other graphic design documents such as comics, posters, mobile UIs, etc. and natural images.

Abstract:
Unsupervised deep learning has recently demonstrated the promise of producing high-quality samples. While it has tremendous potential to promote the image colorization task, the performance is limited owing to the high-dimension of data manifold and model capability. This study presents a novel scheme that exploits the score-based generative model in wavelet domain to address the issues. By taking advantage of the multi-scale and multi-channel representation via wavelet transform, the proposed model learns the richer priors from stacked coarse and detailed wavelet coefficient components jointly and effectively. This strategy also reduces the dimension of the original manifold and alleviates the curse of dimensionality, which is beneficial for estimation and sampling. Moreover, dual consistency terms in the wavelet domain, namely data-consistency and structure-consistency are devised to leverage colorization task better. Specifically, in the training phase, a set of multi-channel tensors consisting of wavelet coefficients is used as the input to train the network with denoising score matching. In the inference phase, samples are iteratively generated via annealed Langevin dynamics with data and structure consistencies. Experiments demonstrated remarkable improvements of the proposed method on both generation and colorization quality, particularly in colorization robustness and diversity.

Abstract:
Detection in large scenes is a challenging issue due to small objects and extreme scale variation. It is difficult for the deep-learning-based detector to extract features of small objects with only a few pixels. Most existing methods employ image pyramid and feature pyramid for multi-scale inference to alleviate this issue. However, they lack scale awareness to adapt to objects with different scales. In this paper, we propose a novel Adaptive Zoom (AdaZoom) network for scale-aware large scene object detection. There are three main contributions. First, an Adaptive Zoom network is proposed to actively focus and adaptively zoom the focused regions for high-performance object detection in large scenes. Second, to tackle the problem of missing annotations for focused regions, we train AdaZoom with the reward which measures the quality of generated regions, based on the paradigm of deep reinforcement learning. At last, we propose collaborative training to iteratively promote the joint performance of AdaZoom and the detector. To validate the effectiveness, we conduct extensive experiments on VisDrone2019, UAVDT and DOTA datasets. The experiments show AdaZoom brings consistent and significant improvement over different detection networks, achieving state-of-the-art performance on these datasets, especially outperforming the existing methods by AP of 4.64% on VisDrone2019.

Abstract:
Photo collage aims to automatically arrange multiple photos on a given canvas with high aesthetic quality. Existing methods are based mainly on handcrafted feature optimization, which cannot adequately capture high-level human aesthetic senses. Deep learning provides a promising way, but owing to the complexity of collage and lack of training data, a solution has yet to be found. In this paper, we propose a novel pipeline for automatic generation of aspect ratio specified collage and the reinforcement learning technique is introduced in non-content-preserving collage. Inspired by manual collages, we model the collage generation as a sequential decision process to adjust spatial positions, orientation angles, placement order and the global layout. To instruct the agent to improve both the overall layout and local details, the reward function is specially designed for collage, considering subjective and objective factors. To overcome the lack of training data, we pretrain our deep aesthetic network on a large scale image aesthetic dataset (CPC) for general aesthetic feature extraction and propose an attention fusion module for structural collage feature representation. We test our model against competing methods on movie and image datasets and our results outperform others in several quality evaluations. Further user studies are also conducted to demonstrate the effectiveness.

Abstract:
Given an image, crowd counting aims to estimate the amount of target objects in the image. With un-predictable installation situations of surveillance systems (or other equipments), crowd counting images from different data sets may exhibit severe discrepancies in viewing angle, scale, lighting condition, etc. As it is usually expensive and time-consuming to annotate each data set for model training, it has been an essential issue in crowd counting to transfer a well-trained model on a labeled data set (source domain) to a new data set (target domain). To tackle this problem, we propose a cross-domain learning network to learn the domain gaps in an unsupervised learning manner. The proposed network comprises of a Multi-granularity Feature-aware Discriminator (MFD) module, a Domain-invariant Feature Adaptation (DFA) module, and a Cross-domain Vanishing Bridge (CVB) module to remove domain-specific information from the extracted features and promote the mapping performances of the network. Unlike most existing methods that use only Global Feature Discriminator (GFD) to align features at image level, an additional Local Feature Discriminator (LFD) is inserted and together with GFD form the MFD module. As a complement to MFD, LFD refines features at pixel level and has the ability to align local features. The DFA module explicitly measures the distances between the source domain features and the target domain features and aligns the marginal distribution of their features with Maximum Mean Discrepancy (MMD). Finally, the CVB module provides an incremental capability of removing the impact of interfering part of the extracted features. Several well-known networks are adopted as the backbone of our algorithm to prove the effectiveness of the proposed adaptation structure. Comprehensive experiments demonstrate that our model achieves competitive performance to the state-of-the-art methods.

Abstract:
Segmentation-based text detectors are flexible to capture arbitrary-shaped text regions. Due to large geometry variance, it is necessary to construct effective and robust representations to identify text regions with various shapes and scales. In this paper, we focus on designing effective multi-scale contextual features for locating text instances. Specially, we develop a Region Context Module (RCM) to summarize the semantic response and adaptively extract text-region-aware information in a limited local area. To construct complementary multi-scale contextual representations, multiple RCM branches with different scales are employed and integrated via Progressive Fusion Module (PFM). Our proposed RCM and PFM serve as the plug-and-play modules which can be incorporated into existing scene text detection platforms to further boost detection performance. Extensive experiments show that our methods achieve state-of-the-art performances on Total-Text, SCUT-CTW1500 and MSRA-TD500 datasets. The code with models will become publicly available at https://github.com/wqtwjt1996/RP-Text.

Abstract:
With the continuous development of computer hardware equipment and deep learning technology, it is easier for people to swap faces in videos by currently-emerging multimedia tampering tools, such as the most popular deepfake. It would bring a series of new threats of security. Although many forensic researches have focused on this new type of manipulation and achieved high detection accuracy, most of which are based on supervised learning mechanism with requiring a large number of labeled samples for training. In this paper, we first develop a novel unsupervised detection manner for identifying deepfake videos. The main fundamental behind our proposed method is that the face region in the real video is taken by the camera while its counterpart in the deepfake video is usually generated by the computer; the provenance of two videos is totally different. Specifically, our method includes two clustering stages based on Photo-Response Non-Uniformity (PRNU) and noiseprint feature. Firstly, the PRNU fingerprint of each video frame is extracted, which is used to cluster the full-size identical source video (regardless of its real or fake). Secondly, we extract the noiseprint from the face region of the video, which is used to identify (re-cluster for the task of binary classification) the deepfake sample in each cluster. Numerical experiments verify our proposed unsupervised method performs very well on our own dataset and the benchmark FF++ dataset. More importantly, its performance rivals that of the supervised-based state-of-the-art detectors.

Abstract:
The graph convolutional network (GCN), as a powerful tool in graph data processing, is widely exploited in many machine learning and computer vision tasks. However, existing GCNs usually assume that the network has fixed outputs, which is usually contrary to the real-world class number being unknown and incremental, leading to an open set classification problem in which the finite training dataset cannot contain all labels in the infinite testing data. To overcome these issues, a novel Bayesian model is proposed, in which we couple GCN and a deep generative clustering model in a unified framework. In our model, the GCN model is used to detect the known classes, the deep generative clustering model is designed to generate the novel classes, and a two-level label generative process is constructed to extend the finite GCN outputs to infinity and fuse the label generated by the GCN model and the deep generative model. Although posterior inference is difficult, our model leads to an efficient variational inference-based optimization method. Experiments on various datasets validate our theoretical analysis and demonstrate that our model can achieve state-of-the-art performance. Our source code has been released on the website.

Abstract:
Multimodal sentiment analysis (MSA) plays an important role in many applications, such as intelligent question-answering, computer-assisted psychotherapy and video understanding, and has attracted considerable attention in recent years. It leverages multimodal signals including verbal language, facial gestures, and acoustic behaviors to identify sentiments in videos. Language modality typically outperforms nonverbal modalities in MSA. Therefore, strengthening the significance of language in MSA will be a vital way to promote recognition accuracy. Considering that the meaning of a sentence often varies in different nonverbal contexts, combining nonverbal information with text representations is conducive to understanding the exact emotion conveyed by an utterance. In this paper, we propose a Cross-modal Enhancement Network (CENet) model to enhance text representations by integrating visual and acoustic information into a language model. Specifically, it embeds a Cross-modal Enhancement (CE) module, which enhances each word representation according to long-range emotional cues implied in unaligned nonverbal data, into a transformer-based pre-trained language model. Moreover, a feature transformation strategy is introduced for acoustic and visual modalities to reduce the distribution differences between the initial representations of verbal and nonverbal modalities, thereby facilitating the fusion of distinct modalities. Extensive experiments on benchmark datasets demonstrate the significant gains of CENet over state-of-the-art methods.

Abstract:
Camouflaged object detection is a challenging visual task since the appearance and morphology of foreground objects and background regions are highly similar in nature. Recent CNN-based studies gradually integrated the high-level semantic information and the low-level local features of images through hierarchical and progressive structures to achieve camouflaged object detection. However, these methods ignore the spatial statistical properties of the local context, which is a critical cue for distinguishing and describing camouflaged objects. To address this problem, we propose a novel Deep Texton-Coherence Network (DTC-Net) that leverages the spatial organization of textons in the foreground and background regions as discriminative cues for camouflaged object detection. Specifically, a Local Bilinear module (LB) is devised to obtain the robust representation of texton to trivial details and illumination changes, by replacing the classic first-order linearization operations with bilinear second-order statistical operations in the convolution process. Next, these texton representations are associated with a Spatial Coherence Organization module (SCO) to capture irregular spatial coherence via a deformable convolutional strategy, and then the descriptions of the textons extracted by the LB module are used as weights to suppress features that are spatially adjacent but have different representations. Finally, the texton-coherence representation is integrated with the original features at different levels to achieve camouflaged object detection. Evaluation on the three most challenging camouflaged object detection datasets demonstrats the superiority of the proposed model when compared to the state-of-the-art methods. Furthermore, our ablation studies and performance analyses demonstrate the effectiveness of the texton-coherence module.

Abstract:
As a challenging visual task, visual object tracking has recently been composed of the classification and regression subtasks. The anchor-free regression network gets rid of the dependence on the anchors, but the redundant range makes it usually regress some samples involving non-target information. Evenly dividing a target by the regular receptive field often causes ambiguous target localization. To address these issues, we propose a regression-selective feature-adaptive tracker (RSFA), where the regression-selective subnetwork can not only free the regression task from anchors, but can also select more effective regression samples using the refined criterion. The proposed feature-adaptive strategy concentrates the classification subnetwork on target feature extraction via adaptively modifying the receptive field, and the attached centrality branch offers a correction for target localization by exploiting the spatial information. Additionally, the designed online update mechanism realizes the tracker's online optimization, improving robustness against target deformation. Extensive experiments are conducted on challenging benchmarks, including GOT10 K, OTB2015, UAV123, NFS, VOT2018, VOT2019 and VOT2020-ST. Our tracker achieves satisfactory tracking results, and the evaluations of its tracking performance rank first or second in comparison with the state-of-the-art tracking algorithms.

Abstract:
Graph-based multi-view clustering method has attracted considerable attention in multi-media data analyse community due to its good clustering performance and efficiency in characterizing the relationship between data. But the existing graph-based clustering methods still have many shortcomings. Firstly, they have high computational complexity due to the eigenvalue decomposition. Secondly, the complementary information and spatial structure embedded in different views can affect the clustering performance. However, some existing graph-based clustering methods do not consider these two points. In this article, we use the anchor graphs of different views as input, which effectively reduces the computational complexity. And then we explicitly consider the complementary information and spatial structure between anchor graphs of different views by minimizing the tensor Schatten p-norm, aiming to achieve a better tensor with low-rank approximation. Finally, we learn the view-consensus anchor graph with connectivity constraints, which can directly indicate clusters by self-weighted strategy. An efficient alternating algorithm is then derived to optimize the proposed multi-view special clustering model. Furthermore, the constructed sequence was proved to converge to the stationary KKT point. Experiments show that our proposed method not only reduces the time cost, but also outperforms the most advanced methods.

Abstract:
Fashion compatibility predictions have obtained a lot of attention recently. Mining the compatibility between fashion items in an outfit is different from learning the visual similarity, since this relationship is more delicate. Decomposing the outfit compatibility into pairwise item matching is a popular way to treat the problem. However, in most existing methods, the items are matched without considering the context, i.e, the remaining items in the outfit. Recent efforts have been made to learn the underlying high order relationships among items by treating the outfit as a whole. These models could be sensitive to the properties of different datasets, and the item representations in these models are not as compact as those in the pairwise models. In this paper, we propose a context conditioning embedding approach to learn compact representations that preserve the shared information among items under the existence of contextual items. We use two different spaces, the general and the contextual spaces, to embed items, where the representation in the contextual space contains information from the context. We employ mutual information maximization for model learning, which is shown to be more appropriate for the problem. With extensive experiments, we show that our model achieves superior performance than other state-of-the-art methods.

Abstract:
Capturing screen content by smart-phone cameras has become a daily routine to record or share instant information from display screens for convenience. However, these recaptured screen images are often degraded by moiré patterns and usually present color cast against the original screen source. We observe that performing demoiréing in raw domain before feeding into the image signal processor (ISP) is more effective than demoiréing in the sRGB domain as done in recent demoiréing works. In this paper, we investigate the demoiréing of raw images through a class-specific learning approach. To this end, we build the first well-aligned raw moiré image dataset by pixel-wise alignment between the recaptured images and source ones. Noting that document images occupy a large portion of screen contents and have different properties from generic images, we propose a class-specific learning strategy for textual images and natural color images. In addition, to deal with moiré patterns with various scales, a multi-scale encoder with multi-level feature fusion is proposed. The shared encoder enables us to extract rich representations for the two kinds of contents and the class-specific decoders benefit the specific content reconstruction by focusing on targeted representations. Experiment results demonstrate that our method achieves state-of-the-art demoiréing performance. We have released the code and dataset in https://github.com/tju-chengyijia/RDNet

Abstract:
Instance segmentation is heavily reliant on large-scale annotated datasets to yield an ideal accuracy. However, annotated data are difficult to collect. To expand the annotated data, a straightforward idea is to introduce semi-supervised learning, which uses a trained model to obtain initial proposals on unlabeled images and then use initial proposals to generate pseudo labels. However, existing methods inevitably introduce the bias for the model learning, i.e., the foreground in initial low-confident proposals (low-confident foreground) is arbitrarily assigned as background. This bias makes the foreground and background closer in the feature space, which degenerates the model accuracy. To address this issue, this paper discards incorrect supervision and designs a bias-correction feature learner. Specifically, on the one hand, low-confident foreground does not participate in supervised learning. On the other hand, we extract possible foreground regions from all initial proposals to construct high-quality positive pairs which depict objects of the same category in contrastive learning. Then, positive pairs are pulled closer in the feature space. This helps models extract closely clustered foreground features. Experimental results demonstrate the effectiveness of our method on the public datasets (i.e., COCO, Cityscapes and Pascal VOC).

Abstract:
Text-based image manipulation is a popular subject and has many applications. However, it is a challenging task because there is no ground-truth edited dataset and textual descriptions have abstractive and ambiguous properties. To alleviate the difficult issues, we propose a manipulation framework consisting of the proposal attentional GANs, language-related semantic mask, and language-guided ranker. Specially, we construct an editing proposal generator to generate the suitable edited proposals with and without semantic conditions, which supports the reorganization of sub-generators to output proposals in various aspects as many as possible. To distinguish the text-relevant and the text-irrelevant regions, we introduce a language-related semantic mask based on the source image and target caption. Then, we exploit a language-guided ranker to retrieve the best edited result from the edited proposals through using the multi-modal similarity and the language-related semantic mask. Extensive experiments on widely-used datasets demonstrate that our model could manipulate images interactively and improve the editing quality effectively.

Abstract:
Preserving privacy is a growing concern in our society where cameras are ubiquitous. In this work, we propose a trainable image acquisition method that removes the sensitive information in the optical domain before it reaches the image sensor. The method benefits from a trainable optical convolution kernel, which transmits the desired information whilst filtering out the sensitive information, making it irretrievable against different privacy attacks in the digital domain. This is in contrast with the current digital privacy-preserving methods that are all vulnerable to direct access attacks. Also, in contrast with most of the previous optical privacy-preserving methods that cannot be trained, our method is data-driven and optimized for the specific application at hand. Moreover, there is no additional computation or power burden on the acquisition system since it works passively in the optical domain and can be even used in conjunction with other privacy-preserving techniques in the digital domain. We demonstrate our new, generic method in several scenarios such as smile or open-mouth detection as the desired attribute while the gender or wearing make-up is filtered out as the sensitive content. Through several experiments, we show that this method is able to reduce around \mathbf 65% of sensitive content while causing a negligible reduction in the desired information. Moreover, we tested our method by deep reconstruction attack and confirmed the ineffectiveness of this attack to reconstruct the original sensitive content. This new method has different use cases such as feedback systems for smart TV content or outdoor advertising.

Abstract:
Online relevance feedback (RF) is widely utilized in instance search (INS) tasks to further refine imperfect ranking results, but it often has low interaction efficiency. The active learning (AL) technique addresses this problem by selecting valuable feedback candidates. However, mainstream AL methods require an initial labeled set for a cold start and are often computationally complex to solve. Therefore, they cannot fully satisfy the requirements for online RF in interactive INS tasks. To address this issue, we propose a confidence-aware active feedback method (CAAF) that is specifically designed for online RF in interactive INS tasks. Inspired by the explicit difficulty modeling scheme in self-paced learning, CAAF utilizes a pairwise manifold ranking loss to evaluate the ranking confidence of each unlabeled sample. The ranking confidence improves not only the interaction efficiency by indicating valuable feedback candidates but also the ranking quality by modulating the diffusion weights in manifold ranking. In addition, we design two acceleration strategies, an approximate optimization scheme and a top-K search scheme, to reduce the computational complexity of CAAF. Extensive experiments on both image INS tasks and video INS tasks searching for buildings, landscapes, persons, and human behaviors demonstrate the effectiveness of the proposed method. Notably, in the real-world, large-scale video INS task of NIST TRECVID 2021, CAAF uses 25% fewer feedback samples to achieve a performance that is nearly equivalent to the champion solution. Moreover, with the same number of feedback samples, CAAF's mAP is 51.9%, significantly surpassing the champion solution by 5.9%. Code is available at https://github.com/nercms-mmap/caaf.

Abstract:
Applying deep learning to video compression has attracted increasing attention in recent few years. In this work, we address end-to-end learned video compression with a special focus on better learning and utilizing temporal contexts. We propose to propagate not only the last reconstructed frame but also the feature before obtaining the reconstructed frame for temporal context mining. From the propagated feature, we learn multi-scale temporal contexts and re-fill the learned temporal contexts into the modules of our compression scheme, including the contextual encoder-decoder, the frame generator, and the temporal context encoder. We discard the parallelization-unfriendly auto-regressive entropy model to pursue a more practical encoding and decoding time. Experimental results show that our proposed scheme achieves a higher compression ratio than the existing learned video codecs. Our scheme also outperforms x264 and x265 (representing industrial software for H.264 and H.265, respectively) as well as the official reference software for H.264, H.265, and H.266 (JM, HM, and VTM, respectively). Specifically, when intra period is 32 and oriented to PSNR, our scheme outperforms H.265–HM by 14.4% bit rate saving; when oriented to MS-SSIM, our scheme outperforms H.266–VTM by 21.1% bit rate saving.

Abstract:
With the development of facial recognition technology, face anti-spoofing as the most important security module of face recognition system becomes more and more important. As a matter of fact, face anti-spoofing is still a challenging task, especially facing various attacks simultaneously. Moreover, most of current detectors mainly focus on binary classification while usually fail to complete the task of fine-grained multiple classification, referring to as replay, print, partial mask, and full mask attacks. To fill the gap, in this context, it is proposed to design the fine-grained detection network for classifying various face spoofing attack modes. First, we propose to establish a Transformer style network structure for feature extraction, where the convolution mapping operation is adopted instead of traditional linear mapping. Specifically, we adopt the self-attention module for extracting long distance feature, and convolution mapping is used to maintain the model's ability to extract local features. Finally, the simple yet effective linear classifier is introduced for fine-grained classification. Moreover, with the help of the VGG based style-transfer network, the well-designed scheme of data augmentation module is proposed for solving the problem of insufficient training samples. In the large-scale experiments, compared with the baseline detectors, our proposed fine-grained classifier with low computation cost performs its superiority for multiple classification.

Abstract:
Instance segmentation is a challenging task aiming at classifying and segmenting all object instances of specific classes. While two-stage box-based methods achieve top performances in the image domain, they cannot easily extend their superiority into the video domain. This is because they usually deal with features or images cropped from the detected bounding boxes without alignment, failing to capture pixel-level temporal consistency. We embrace the observation that bottom-up methods dealing with box-free features could offer accurate spacial correlations across frames, which can be fully utilized for object and pixel level tracking. We first propose our bottom-up framework equipped with a temporal context fusion module to better encode inter-frame correlations. Intra-frame cues for semantic segmentation and object localization are simultaneously extracted and reconstructed by corresponding decoders after a shared backbone. For efficient and robust tracking among instances, we introduce an instance-level correspondence across adjacent frames, which is represented by a center-to-center flow, termed as instance flow, to assemble messy dense temporal correspondences. Experiments demonstrate that the proposed method outperforms the state-of-the-art online methods (taking image-level input) on the challenging Youtube-VIS dataset (Yang et al., 2019).

Abstract:
Emotional Voice Conversion (EVC) technology aims to transfer emotional state in speech while keeping the linguistic information and speaker identity unchanged. Prior studies on EVC have been limited to perform the conversion for a specific speaker or a predefined set of multiple speakers seen in the training stage. When encountering arbitrary speakers that may be unseen during training (outside the set of speakers used in training), existing EVC methods have limited conversion capabilities. However, converting the emotion of arbitrary speakers, even those unseen during the training procedure, in one model is much more challenging and much more attractive in real-world scenarios. To address this problem, in this study, we propose SIEVC, a novel speaker-independent emotional voice conversion framework for arbitrary speakers via disentangled representation learning. The proposed method employs the autoencoder framework to disentangle the emotion information and emotion-independent information of each input speech into separated representation spaces. To achieve better disentanglement, we incorporate mutual information minimization into the training process. In addition, adversarial training is applied to enhance the quality of the generated audio signals. Finally, speaker-independent EVC for arbitrary speakers could be achieved by only replacing the emotion representations of source speech with the target ones. The experimental results demonstrate that the proposed EVC model outperforms the baseline models in terms of objective and subjective evaluation for both seen and unseen speakers.

Abstract:
The label assignment problem is a core task in object detection, which mainly focuses on how to define the positive/negative samples during the training phase. Recent works have proved that label assignment is significant for performance improvement of the detector. In this article, we propose an exquisite strategy that can dynamically assign labels according samples' joint scores (classification and location). Moreover, our strategy can apply to both 2D and 3D monocular detectors. In our strategy, we formulate label assignment as an optimization problem. Concretely, we first calculate the classification and location costs of each sample, which are treated as points in a 2-D coordinate system. Then an optimal divider line that minimizes the sum of point-to-line distances is designed to separate the positive/negative samples. An iterative Genetic Algorithm is employed in acquiring the optimal solution. Furthermore, a GIoU auxiliary branch is devised to keep sample selection consistent during the training and testing phase. Benefitting from the non-maximum suppression (NMS) that utilizes the joint scores of classification and location, excellent detection performance is achieved. Extensive experiments conducted on MS COCO, PASCAL VOC (2D object detection), and KITTI (3D object detection) verify the effectiveness and universality of our proposed Optimal Partition Assignment (OPA).

Abstract:
Temporal modeling still remains as a challenge for action recognition. Most existing temporal models focus on learning local variation between neighbor frames. There exists obvious deviations between local and global variations, such as subtle and notable motion variations. In this paper, we propose a global temporal difference module for action recognition, which consists of two sub-modules, i.e., a global aggregation module and a global difference module. These two sub-modules cooperate following the idea of using prior knowledge from the global view (i.e., global motion variation) to guide local learning at each moment. In the global aggregation module, the global prior knowledge is learned by aggregating the visual feature sequence of video into a global vector. In the global difference module, we prepare the difference vector sequence of video by subtracting each local vector from the global vector. Our method performs as a contextual guidance with a global view. The sequential dependency between these difference vectors is exploited with a channel-wise self-attention operation. Finally, the difference vectors at each timestamp are further used to enhance the semantics of the original local features. The enhanced features endow the action recognition has less deviation to understand the variation in the video globally. We instantiate the global temporal difference module into the ResNet block to form a global temporal difference network (GTDNet). Exhaustive experiments are conducted and our method achieves competitive performance at small FLOPs on Something-Something V1 & V2 and Kinetics-400.

Abstract:
Image quality assessment (IQA) is very important for both end-users and service-providers since a high-quality image can significantly improve the user's quality of experience (QoE). Most existing blind image quality assessment (BIQA) models were developed for synthetically distorted images, however, they perform poorly on in-the-wild images, which are widely existed in various practical applications. In this paper, a BIQA model is proposed that consists of a desirable self-supervised feature learning approach to mitigate the data shortage problem and learn comprehensive feature representations, and a self-attention-based feature fusion module to introduce self-attention mechanism. We develop the image quality assessment model under the framework of contrastive learning with multi views. Since human visual system perceives signals through multiple channels, the most important visual information should exist among all views of the channels. So we design the cross-view consistent information mining (CVC-IM) module to extract compact mutual information between different views. Color information and pseudo-reference image (PRI) of different distortion types are employed to formulate rich feature embeddings and preserve the quality-aware fidelity of learned representations. We employ the Transformer as the self-attention-based architecture to integrate feature embeddings. Extensive experiments show that our model achieves remarkable image quality assessment results on in-the-wild IQA datasets.

Abstract:
Existing image-to-image translation approaches can deal with simple scenes or styles effectively, such as summer-to-winter, horses-to-zebra, and photo-to-map. Although a great progress has been made by GAN-based methods recently, the performance of night-to-day (N2D) translation remains unsatisfactory due to imbalanced/poor visibility, and thus leading to translation ambiguity. To improve the quality of N2D translation, we propose an unpaired translation scheme based on a semantic prior generator, namely SPN2D-GAN, in a weakly- supervised manner with consideration of both image and semantic information. Specifically, we design a novel N2D generator, which can adopt the semantic information of images as prior knowledge to generate more reasonable and realistic results. Also, we suggest adjusting the brightness of nighttime images to boost the visibility, so that the generator can better extract content information. Moreover, the proposed SPN2D-GAN translates images by enforcing the distribution of daytime images in both image and semantic domains on final outputs. Besides, the cycle consistency is employed to preserve the fidelity between translations from two directions. Extensive experimental results are provided to reveal the effectiveness of our design, and demonstrate its superior performance over other state-of-the-art N2D translation approaches both quantitatively and qualitatively.

Affiliations: School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China; College of Computer Science and Technology/College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics, Nanjing, China; School of Computer Science, University of Adelaide, Adelaide, SA, Australia; School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; College of Automation, Nanjing University of Posts and Telecommunications, Nanjing, China; School of Computer, Wuhan University, Wuhan, China

Abstract:
Visible-infrared person re-identification (VI Re-ID) is designed to match person images of the same identity from visible and infrared cameras. Transformer structures have been successfully applied in the field of VI Re-ID. However, previous Transformer-based methods were mainly designed to capture global content information in a single modality, and could not simultaneously perceive semantic information between two modalities from a global perspective. To solve this problem, we propose a novel framework named the cross-modality interaction Transformer (CMIT). It has strong abilities in modeling spatial and sequential features that can capture dependencies between long-range features, and explicitly improves the discriminativeness of features by exchanging information across modalities, thus contributing to obtaining modality-invariant representations. Specifically, CMIT utilizes a cross-modality attention mechanism to enrich the feature representations of each patch token by interacting with the patch tokens of the other modality, and aggregates local features of the CNN structure and global information of the Transformer structure to mine feature saliency representation. Furthermore, the modality-discriminative (MD) loss function is proposed to learn potential consistency between modalities to encourage intra-modality compactness within class and inter-modality separation between classes. Extensive experiments on two benchmarks demonstrate that our approach outperforms state-of-the-art methods.

Abstract:
Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description. This retrieval task is characterized with both modal heterogeneity and fine-grained matching. To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment. However, most existing approaches only consider the alignment confined at their individual scales, e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the presumable alignment in feature extraction, while overlooking the cross-scale alignment, e.g., image-phrase. In this paper, we present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities. Specifically, ACSA consists of a global-level alignment module and an asymmetric cross-attention module, where the former aligns an image and texts on a global scale, and the latter applies the cross-attention mechanism to dynamically align the cross-modal entities in region/image-phrase scales. Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid demonstrate the effectiveness of our approach.

Abstract:
Given a text description, Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. TLG is inherently a challenging task, as it requires comprehensive understanding of both sentence semantics and video contents. Previous works either tackle this task in a fully-supervised setting that requires a large amount of temporal annotations or in a weakly-supervised setting that usually cannot achieve satisfactory performance. Since manual annotations are expensive, to cope with limited annotations, we tackle TLG in a semi-supervised way by incorporating self-supervised learning, and propose Self-Supervised Semi-Supervised Temporal Language Grounding (S^4TLG). S^4TLG consists of two parts: (1) A pseudo label generation module that adaptively produces instant pseudo labels for unlabeled samples based on predictions from a teacher model; (2) A self-supervised feature learning module with inter-modal and intra-modal contrastive losses to learn video feature representations under the constraints of video content consistency and video-text alignment. We conduct extensive experiments on the ActivityNet-CD-OOD and Charades-CD-OOD datasets. The results demonstrate that our proposed S^4TLG can achieve competitive performance compared to fully-supervised state-of-the-art methods while only requiring a small portion of temporal annotations.

Abstract:
Cameras usually produce low-quality images under low-light conditions. Though many methods have been proposed to enhance the visibility of low-light images, they are mainly designed for illumination correction and less capable of suppressing the artifacts. In this paper, we propose to enhance the visibility and suppress artifacts by purifying low-light images under the guidance of the NIR enlightened image captured by using the near-infrared light as compensation. Specifically, we introduce a disentanglement framework to disentangle the structure and color components from the NIR enlightened and RGB images, respectively. Correspondingly, we introduce a new dataset with the RGB and NIR enlightened images for training and evaluation purposes. The experimental results show that our proposed method achieves promising results.

Abstract:
Video frame interpolation has made great progress in estimating advanced optical flow and synthesizing in-between frames sequentially. However, frame interpolation involving various resolutions and motions remains challenging due to limited or fixed pre-trained networks. Inspired by the success of the coarse-to-fine scheme for video frame interpolation, i.e., gradually interpolating frames of different resolutions, we propose a progressive boosting network (ProBoost-Net) based on a multi-scale framework to achieve flexible recurrent scales and then gradually optimize optical flow estimation and frame interpolation. Specifically, we designed a dense motion boosting (DMB) module to transfer features close to real motion to the decoded features from the later scales, which provides complementary information to refine the motion further. Furthermore, to ensure the accuracy of the estimated motion features at each scale, we propose a motion adaptive fusion (MAF) module that adaptively deals with motions with different receptive fields according to the motion conditions. Thanks to the framework's flexible recurrent scales, we can customize the number of scales and make trade-offs between computation and quality depending on the application scenario. Extensive experiments with various datasets demonstrated the superiority of our proposed method over state-of-the-art approaches in various scenarios.

Abstract:
Model compression is an essential step for large-scale pre-training models toward practical application and deployment on the edge device. However, when conventional compression methods following ‘pre-training then compressing’ two-phase pipeline are applied to Vision-and-Language Pre-training (VLP) models, it will lead to a high calculation and memory overhead. In this work, we break the two-phase pipeline and propose an efficient and effective one-phase VLP model compression mechanism, named REDUCER, which stands for ‘simultaneously training and compREssing’ VLP model via progressive moDUle replaCing and nEtwork Rewiring. Specifically, REDUCER consists of three insightful designs. Firstly, we design a one-phase compression framework to train and compress the VLP model simultaneously to avoid the extra calculation and memory cost caused by an isolated model compression phase in the conventional two-phase pipeline. Secondly, we propose an adaptive progressive module replacing mechanism to compress the model depth free from explicit knowledge distillation losses, relieving the multi-task optimization problems. Thirdly, we integrate pruning techniques into VLP model compression to simultaneously compress the model in width and depth. Overall, we obtain a lightweight VLP model with only one pre-training phase, and it is the first one-phase compression method for VLP models. Extensive experiments have been conducted on representative VLP models, i.e., ClipBERT and VICTOR, and the experimental results show a superior trade-off between performance and efficiency.

Abstract:
The ever-increasing demand for immersive applications has made point cloud an important data type for 3D processing. Tree-based data structures are commonly used for representing point clouds where memory pointers are used to realize the connection among points. The significant cost of data storage and irregular access patterns for processing points make such data structures largely inefficient. In this paper, we examine a point cloud representation using compressed geometric arrays (CGA) that reduces the size of point cloud and limits the amount of memory indirection. Our experimental results on a set of critical point cloud operations indicate 998× speed-up, 410× better bandwidth utilization, and 58% storage reduction for CGA over the state-of-the-art point cloud library (PCL).

Abstract:
Recently, few-shot object detection (FSOD) has become an increasing research focus, which can largely alleviate the heavy dependency on expensive annotations in the traditional object detection task. However, existing FSOD approaches fail to generate sufficient high-quality positive region proposals which are the key to detection performance, due to the lack of informative knowledge from base classes and non-specific alteration for novel classes. To address the problem, this paper presents a simple yet effective few-shot object detection framework referred to as Temporal Speciation Network (TeSNet) with an evolving training, which improves the diversity and rationality of positive proposal generation. Our TeSNet, imitating the natural evolution which relies on inheritation and mutation, correspondingly consists of two key components: a Selective Recombination Module (SRM) for effectively inheriting from base classes and a Mutational Region Proposal Network (MRPN) for flexibly mutating according to the unique traits of novel samples. Specifically, SRM selects and reorganizes relevant base categories, and further instantiates diverse individuals to ensure the diversity of positive proposals. MRPN adapts the parameters trained on base classes aiming for accurately locating positive proposals. Extensive experiments are conducted on several commonly-used datasets, in which our TeSNet achieves state-of-the-art results and outperforms baselines by large margin.

Abstract:
With the rapid development of stereoscopic vision applications, stereo image processing techniques have attracted increasing attention in both academic and industrial communities. In this paper, we study the fundamental stereo image super-resolution (SR) problem, which aims to recover high-resolution stereo images from low-resolution (LR) stereo images. Since disparities between stereo images vary significantly, convolutional network-based stereo image SR methods show a limitation in capturing long-range dependencies. To address this problem, this paper proposes to leverage the capability of self-attention in Transformers to efficiently capture reliable stereo correspondence and incorporate cross-view information for stereo image SR. Our model, named Steformer, consists of three parts: cross attentive feature extraction, cross-to-intra information integration and high-quality image reconstruction. In particular, the cross attentive feature extraction module employs residual cross Steformer blocks (RCSB) for long-range cross-view information extraction. Then, the cross-to-intra information integration module exploits cross-view and intra-view information using cross-to-intra attention mechanism (C2IAM). Finally, residual Steformer blocks (RSB) are designed for feature pre-processing in high-quality image reconstruction. Extensive experiments show that Steformer achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations, while the total number of parameters can be reduced by up to 40.71%.

Abstract:
Pretraining large models on generous multi-modal corpora has accelerated the development of visual-linguistic (VL) representation and achieved great success on various vision-and-language downstream tasks. Learning these models is usually executed by predicting the randomly masked words of captions or patches in images. Such approaches, nevertheless, seldom explore the supervision of causalities behind the caption descriptions or the procedure of generating events beyond still images. In this work, we endow the pretrained models with high-level cognition by delving into dynamic contexts to model the visual and linguistic causalities uniformly. Specifically, we format the dynamic contexts of an image as the sentences describing the events before, on, and after image. Unlike traditional caption-wise similarity, we propose a novel dynamic contexts-based similarity (DCS) metric, in which the correlation of potential causes and effects besides immediate visual content are considered to measure the relevance among images. DCS can be further simplified by parameterizing event continuity to relax the requirements on dense contextual event annotations. A new pre-task is designed to minimize the feature distances of dynamically contextual relevant images and incorporate the event causality and commonsense knowledge into the VL representation learning. Models based on our dynamic contexts significantly outperform typical VL models on multiple cross-modal downstream tasks, including the conventional visual commonsense reasoning (VCR), visual question answering (VQA), zero-shot image-text retrieval, and extended image / event ordering tasks.

Abstract:
For Deepfake detection, residual-based features can preserve tampering traces and suppress irrelevant image content. However, inappropriate residual prediction brings side effects on detection accuracy. Meanwhile, residual-domain features are easily affected by some image operations such as lossy compression. Most existing works exploit either spatial-domain or residual-domain features, which are fed into the backbone network for feature learning. Actually, both types of features are mutually correlated. In this work, we propose an adaptive fusion based guided residuals network (AdapGRnet), which fuses spatial-domain and residual-domain features in a mutually reinforcing way, for Deepfake detection. Specifically, we present a fine-grained manipulation trace extractor (MTE), which is a key module of AdapGRnet. Compared with the prediction-based residuals, MTE can avoid the potential bias caused by inappropriate prediction. Moreover, an attention fusion mechanism (AFM) is designed to selectively emphasize feature channel maps and adaptively allocate the weights for two streams. Experimental results show that AdapGRnet achieves better detection accuracies than the state-of-the-art works on four public fake face datasets including HFF, FaceForensics++, DFDC and CelebDF. Especially, AdapGRnet achieves an accuracy up to 96.52% on the HFF-JP60 dataset, which improves about 5.50%. That is, AdapGRnet achieves better robustness than the existing works.

Abstract:
Multi-modal hashing technology can support large-scale multimedia retrieval well, because of its fast query speed and low storage consumption. Although many multi-modal hashing methods have been developed in the literature, they still suffer from three important problems: 1) Most multi-modal hashing methods assume that multi-modal data are complete. This ideal assumption limits their application to practical retrieval scenarios, where the modality-missing is common. 2) Existing partial multi-modal hashing methods directly model incomplete multi-modal data for hash learning, which may result in partial multi-modal semantics in the learned hash codes. 3) Most of the methods are based on the shallow learning framework, which inevitably suffers from limited representation capability. To solve the above problems, we propose a flexible deep partial multi-modal hash learning framework, named Neighbor-aware Completion Hashing (NCH). Our framework jointly performs the cross-modal completion learning for incomplete multi-modal data and the multi-modal hash learning. It can not only support model training with incomplete multi-modal data but also handle incomplete multi-modal queries. Besides, we design a neighbor-aware completion learning module to capture neighbor semantics and generate distribution-consistent completed features. Finally, we conduct extensive experiments to evaluate our method on both fully-paired and partial multi-modal retrieval scenarios. The experimental results verify the superiority of our proposed method over the state-of-the-art baselines.

Abstract:
We propose a novel Text-to-Image Generation Network, Adaptive Layout Refinement Generative Adversarial Network (ALR-GAN), to adaptively refine the layout of synthesized images without any auxiliary information. The ALR-GAN includes an Adaptive Layout Refinement (ALR) module and a Layout Visual Refinement (LVR) loss. The ALR module aligns the layout structure (which refers to locations of objects and background) of a synthesized image with that of its corresponding real image. In ALR module, we proposed an Adaptive Layout Refinement (ALR) loss to balance the matching of hard and easy features, for more efficient layout structure matching. Based on the refined layout structure, the LVR loss further refines the visual representation within the layout area. Experimental results on two widely-used datasets show that ALR-GAN performs competitively at the Text-to-Image generation task.

Abstract:
Video-text retrieval is a fundamental task in managing the emerging massive amounts of video data. The main challenge focuses on learning a common representation space for videos and queries where the similarity measurement can reflect the semantic closeness. However, existing video-text retrieval models may suffer from the following noise in the common space learning procedure: First, the video-text correspondences in positive pairs may not be exact matches. The crowdsourcing annotation for existing datasets leads to inevitable tagging noise for non-expert annotators. Second, the learning of video-text representation is based on the negative samples randomly sampled. Instances that are semantically similar to the query may be incorrectly categorized as negative samples. To alleviate the adverse impact of these noisy pairs, we propose a novel robust video-text retrieval method that protects the model from noisy positive and negative pairs by identifying and calibrating noisy pairs with their uncertainty score. In particular, we propose a noisy pair identifier, which divides the training dataset into noisy and clean subsets based on the estimated uncertainty of each pair. Then, with the help of uncertainties, we calibrate the two types of noisy pairs with an adaptive margin triplet loss and a weighted triplet loss function, respectively. To verify the effectiveness of our methods, we conduct extensive experiments on three widely used datasets. Experimental results show that the proposed robust video-text retrieval methods successfully identify and calibrate the noisy pairs and improve retrieval performance.

Abstract:
As an important visual understanding task, scene graph generation has been drawing widespread attention and could boost a broad range of downstream vision applications. Traditional scene graph generation methods based on different context refinements are trained with probabilistic chain rule, which treats objects and relationships as independent entities. Despite their surprisingly great progress, such a plain formulation unconsciously ignores the latent geometric structure of entities and relationships. To address this issue, we move beyond the traditional real-valued representations and use Quaternion Relation Embedding (QuatRE) to generate scene graphs with more expressive hypercomplex representations. More specifically, we introduce the concept of quaternion representations, hyper-complex valued with three imaginary components for objects entities, then formulate the relation triplets with Hamilton product. Benefiting from explicitly modeling the latent inter-dependencies among all imaginary components and strong expressive capacity, our proposed QuatRE method could better capture the interactions between entities. More importantly, our novel QuatRE method can be treated as a plug-in and well generalized into other methods for performance improvement as it involves no additional layers. Finally, extensive comparisons of our proposed method against the state-of-the-art methods on two large-scale and widely-used datasets, i.e. Visual Genome and Open Images, demonstrated our superiority and generalization capability on various metrics for biased or unbiased inference.

Abstract:
With the popularization of ultra high definition (UHD) high dynamic range (HDR) displays, recent works focus on upgrading high definition (HD) standard dynamic range (SDR) videos to UHD-HDR versions, aiming to provides richer details and higher contrasts on advanced modern displays. However, joint considering the upgrading & downgrading translations between two types of videos, which is practical in real applications, is generally neglected. On the one hand, downgrading translation is the key to showing UHD-HDR videos on HD-SDR displays. On the other hand, considering both translations enables joint optimization and results in high quality translation. To this end, we propose the bidirectional translation network (BiT-Net), which jointly considers two translations in one network for the first time. In brief, BiT-Net is elaborately designed in an invertible fashion that can be efficiently inferred along forward and backward directions for downgrading and upgrading tasks, respectively. Based on this framework, we divide each direction into three sub-tasks, i.e., decomposition, structure-guided translation, and synthesis, to effectively translate the dynamic range and the high-frequency details. Benefiting from the dedicated architecture, our BiT-Net can work on 1) downgrading UHD-HDR videos, 2) upgrading existing HD-SDR videos, and 3) synthesizing UHD-HDR versions from the downgraded HD-SDR videos. Experiments show that the proposed method achieves state-of-the-art performances on all these three tasks.

Abstract:
Nowadays, two-dimensional (2D) barcodes have been widely used in various domains. And a series of aesthetic 2D barcode schemes have been proposed to improve the visual quality and readability of 2D barcodes for better integration with marketing materials. Yet we believe that the existing aesthetic 2D barcode schemes are partially aesthetic because they only beautify the data area but retain the position detection patterns with the black-white appearance of traditional 2D barcode schemes. Thus, in this paper, we propose the first overall aesthetic 2D barcode scheme, called OAcode, in which the position detection pattern is canceled. Its detection process is based on the pre-designed symmetrical data area of OAcode, whose symmetry could be used as the calibration signal to restore the perspective transformation in the barcode scanning process. Moreover, an enhanced demodulation method is proposed to resist the lens distortion common in the camera-shooting process. The experimental results illustrate that when 5 × 5 \textcm OAcode is captured with a resolution of 720× 1280 pixels, at the screen-camera distance of 10 \textcm and the angle less or equal to 25^\circ , OAcode has 100% detection rate and 99.5% demodulation accuracy. For 10 × 10 \textcm OAcode, it could be extracted by consumer-grade mobile phones at a distance of 90 \textcm with around 90% accuracy.

Abstract:
Dense captioning generates more detailed spoken descriptions for complex visual scenes. Despite several promising leads, existing methods still have two broad limitations: 1) The vast majority of prior arts only consider visual contextual clues during captioning but ignore potentially important textual context; 2) current imbalanced learning mechanisms limit the diversity of vocabulary learned from the dictionary, thus giving rise to low language-learning efficiency. To alleviate these gaps, in this paper, we propose an end-to-end enhanced dense captioning architecture, namely Enhanced Transformer Dense Captioner (ETDC), which obtains textual context from surrounding regions and dynamically diversifies the vocabulary bank during captioning. Concretely, we first propose the Textual Context Module (TCM), which is integrated into each self-attention layer of the Transformer decoder, to capture the surrounding textual context. Moreover, we take full advantage of the class information of object context and propose a Dynamic Vocabulary Frequency Histogram (DVFH) re-sampling strategy during training to balance words with different frequencies. The proposed method is tested on the standard dense captioning datasets and surpasses the state-of-the-art methods in terms of mean Average Precision (mAP).

Abstract:
Semi-supervised learning is a common way that investigates how to improve performance of a visual learning model, while data annotation is far from sufficient. Recent works in semi-supervised deep learning have successfully applied consistency regularization, which encourages a model to maintain consistent predictions for different perturbed versions of an image. However, most of such methods ignore the category correlation of image features, especially when exploiting strong augmentation methods for unlabeled images. To address this problem, we propose PConMatch, a model that leverages a probabilistic contrastive learning framework to separate the features of strongly-augmented versions from different classes. A semi-supervised probabilistic contrastive loss is designed, which takes both labeled and unlabeled samples into account and develops an auxiliary module to generate a probability score to measure the model prediction confidence for each sample. Specifically, PConMatch first generates a pair of weakly-augmented versions for each labeled sample, and produces a weakly-augmented version and a corresponding pair of strongly-augmented versions for each unlabeled sample. Second, a probability score module is proposed to assign pseudo-labeling confidence scores to strongly-augmented unlabeled images. Finally, the probability score of each sample is further passed to the contrastive loss, combining with consistency regularization to enable the model to learn better feature representations. Extensive experiments on four publicly available image classification benchmarks demonstrate that the proposed approach achieves state-of-the-art performance in image classification. Several rigorous ablation studies are conducted to validate the effectiveness of the method.

Abstract:
As a de facto solution, the vanilla Vision Transformers (ViTs) are encouraged to model long-range dependencies between arbitrary image patches while the global attended receptive field leads to quadratic computational cost. Another branch of Vision Transformers exploits local attention inspired by CNNs, which only models the interactions between patches in small neighborhoods. Although such a solution reduces the computational cost, it naturally suffers from small attended receptive fields, which may limit the performance. In this work, we explore effective Vision Transformers to pursue a preferable trade-off between the computational complexity and size of the attended receptive field. By analyzing the patch interaction of global attention in ViTs, we observe two key properties in the shallow layers, namely locality and sparsity, indicating the redundancy of global dependency modeling in shallow layers of ViTs. Accordingly, we propose Multi-Scale Dilated Attention (MSDA) to model local and sparse patch interaction within the sliding window. With a pyramid architecture, we construct a Multi-Scale Dilated Transformer (DilateFormer) by stacking MSDA blocks at low-level stages and global multi-head self-attention blocks at high-level stages. Our experiment results show that our DilateFormer achieves state-of-the-art performance on various vision tasks. On ImageNet-1 K classification task, DilateFormer achieves comparable performance with 70% fewer FLOPs compared with existing state-of-the-art models. Our DilateFormer-Base achieves 85.6% top-1 accuracy on ImageNet-1 K classification task, 53.5% box mAP/46.1% mask mAP on COCO object detection/instance segmentation task and 51.1% MS mIoU on ADE20 K semantic segmentation task. The code is available at https://isee-ai.cn/~jiaojiayu/DilteFormer.html.

Abstract:
Cross-modal hashing has been widely used in multimedia retrieval tasks due to its fast retrieval speed and low storage cost. Recently, many deep unsupervised cross-modal hashing methods have been proposed to deal the unlabeled datasets. These methods usually construct an instance similarity matrix by fusing the image and text modality-specific similarity matrices as the guiding information to train the hashing networks. However, most of them directly use cosine similarities between the bag-of-words (BoW) vectors of text datapoints to define the text modality-specific similarity matrix, which fails to mine the semantic similarity information contained in the text modal datapoints and leads to the poor quality of the instance similarity matrix. To tackle the aforementioned problem, in this paper, we propose a novel Unsupervised Cross-modal Hashing via Semantic Text Mining, called UCHSTM. Specifically, UCHSTM first mines the correlations between the words of text datapoints. Then, UCHSTM constructs the text modality-specific similarity matrix for the training instances based on the mined correlations between their words. Next, UCHSTM fuses the image and text modality-specific similarity matrices as the final instance similarity matrix to guide the training of hashing model. Furthermore, during the process of training the hashing networks, a novel self-redefined-similarity loss is proposed to further correct some wrong defined similarities in the constructed instance similarity matrix, thereby further enhancing the retrieval performance. Extensive experiments on two widely used datasets show that the proposed UCHSTM outperforms state-of-the-art baselines on cross-modal retrieval tasks.

Abstract:
Domain adaptation aims to transfer knowledge from a label-rich source domain to an unlabeled target domain. A common strategy is to assign pseudo-labels to unlabeled target samples for performing representation learning. However, most existing methods only apply the source-guided classifier to generate the source-biased pseudo-labels for self-training, leading to biased target representations. Moreover, the generated pseudo-labels ignore the manifold assumption that neighboring samples are likely to have the same labels. To address the above problem, we formulate a novel structural knowledge to assign target-oriented and manifold-guided pseudo-labels for unlabeled target samples. The structural knowledge consists of cluster-based knowledge and locality-based knowledge. The cluster-based knowledge denotes the label consistency between the target samples and the non-parametric target cluster centers, making the pseudo-labels target-oriented. The locality-based knowledge constrains the target sample and its neighbors to satisfy the manifold assumption. As the neighbors contain the source and target samples, the source and target locality-based knowledge are utilized to boost the descriptions. With the structural knowledge, we propose a novel Dual Structural Knowledge Interaction (DSKI) framework for domain adaptation. For generating aligned and discriminative features, knowledge consistency constraint and instance mutual constraint are proposed in DSKI. Evaluations on three benchmarks demonstrate the effectiveness of the Dual Structural Knowledge Interaction, e.g., 74.9%, 87.7%, and 90.8% for Office-Home, VisDa-2017, and Office-31, respectively.

Abstract:
Deep hashing has proven to be efficient and effective for large-scale face retrieval. However, existing hashing methods are designed for normal face images only. They fail to consider the fact that face images may be occluded because of wearing masks, hats, glasses, etc. Retrieval performance of existing face retrieval methods is much worse when dealing with occluded face images. In this work, we propose the knowledge distillation hashing (KDH) to deal with occluded face images. The KDH is a two-stage learning approach with teacher-student model distillation. We first train a teacher hashing network using normal face images and then the knowledge from teacher model is used to guide the optimization of the student model using occluded face images as input only. With knowledge distillation, we build a connection between imperfect face information and the optimal hash codes. Experimental results show that the KDH yields significant improvements and better retrieval performance in comparison to existing state-of-the-art deep hashing retrieval methods under six different face occlusion situations.

Abstract:
Recent works on language-guided image manipu- lation have shown great power of language in providing rich semantics, especially for face images. However, the other natural information, motions, in language is less explored. In this article, we leverage the motion information and study a novel task, language-guided face animation, that aims to animate a static face image with the help of languages. To better utilize both semantics and motions from languages, we propose a simple yet effective framework. Specifically, we propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames. To optimize the proposed framework, three carefully designed loss functions are proposed including a regularization loss to keep the face identity, a path length regularization loss to ensure motion smoothness, and a contrastive loss to enable video synthesis with various language guidance in one single model. Extensive experiments with both qualitative and quantitative evaluations on diverse domains (e.g., human face, anime face, and dog face) demonstrate the superiority of our model in generating high-quality and realistic videos from one still image with the guidance of language.

Abstract:
Fixation prediction aims to simulate human visual selection mechanism and estimate the visual saliency degree of regions in a scene. In semantically rich scenes, there are generally multiple salient regions. This condition requires a fixation prediction model to understand the relative importance relationship of multiple salient regions, that is, to identify which region is more important. In practice, existing fixation prediction models implicitly explore the relative importance relationship in the end-to-end training process while they do not work well. In this article, we propose a novel Relative Importance-aware Network (RINet) to explicitly explore the modeling of relative importance in fixation prediction. RINet perceives multi-scale local and global relative importance through the Hierarchical Relative Importance Enhancement (HRIE) module. Within a single scale subspace, on the one hand, HRIE module regards the similarity matrix as the local relative importance map to weight the input feature. On the other hand, HRIE module integrates a set of local relative importance maps into one map, defined as the global relative importance map, to grasp global relative importance. Moreover, we propose a Complexity-Relevant Focal (CRF) loss for network training. As such, we can progressively emphasize learning difficult samples for better handling the complicated scenarios, further improving the performance. The ablation studies confirm the contributions of key components of our RINet, and extensive experiments on five datasets demonstrate our RINet is superior to 28 relevant state-of-the-art models.

Abstract:
This paper proposes a novel method, named Refined Knowledge Transfer (RKT), for language-based person search. Existing state-of-the-art methods do not deal with knowledge imbalance between image and text. In detail, textual identity knowledge is limited, but the image contains more identity knowledge. We propose Cross-Modal Knowledge Transfer (CMKT) to enhance textual identity knowledge by image to address this problem. Besides, multiple texts of one image include more identity knowledge than a single text. Thus, we propose Intra-Modal Knowledge Transfer (IMKT) to enhance textual identity knowledge by other texts. These two types of knowledge transfer will enhance the identity knowledge in text. Additionally, by considering that identity-irrelevant knowledge is transferred to text, we propose Knowledge Refiner (KR) to refine the knowledge in text. KR is capable of preserving identity knowledge and discarding identity-irrelevant knowledge. By combining CMKT, IMKT, and KR, RKT makes textual identity knowledge more salient. Extensive experiments show the state-of-the-art performance of RKT on the CUHK-PEDES and our proposed PRW-PEDES-CN datasets. In addition, the decent generalization ability of RKT is also validated on the Flickr30K, CUB, and Flowers datasets.

Abstract:
Multimedia-based recommendation is a challenging task that requires not only learning collaborative signals from user-item interaction, but also capturing modality-specific user interest clues from complex multimedia content. Though significant progress on this challenge has been made, we argue that current solutions remain limited by multimodal noise contamination. Specifically, a considerable proportion of multimedia content is irrelevant to the user preference, such as the background, overall layout, and brightness of images; the word order and semantic-free words in titles; etc. We take this irrelevant information as noise contamination to discover user preferences. Moreover, most recent research has been conducted by graph learning. This means that noise is diffused into the user and item representations with the message propagation; the contamination influence is further amplified. To tackle this problem, we develop a novel framework named Multimodal Graph Contrastive Learning (MGCL), which captures collaborative signals from interactions and uses visual and textual modalities to respectively extract modality-specific user preference clues. The key idea of MGCL involves two aspects: First, to alleviate noise contamination during graph learning, we construct three parallel graph convolution networks to independently generate three types of user and item representations, containing collaborative signals, visual preference clues, and textual preference clues. Second, to eliminate as much preference-independent noisy information as possible from the generated representations, we incorporate sufficient self-supervised signals into the model optimization with the help of contrastive learning, thus enhancing the expressiveness of the user and item representations. Extensive experiments validate the effectiveness and scalability of MGCL at https://github.com/hfutmars/MGCL.

Abstract:
Three-dimensional reconstruction is a multimedia technology widely used in computer-aided modeling and 3D animation. Nevertheless, it is still hard for reconstruction methods to overcome the 3D geometry missing and the object occlusion in the single-view images. In this article, we propose a novel method (CPG3D) for reconstructing high-quality 3D shapes from a single image under the guidance of prior knowledge. Using the single-view image as the query, prior knowledge is collected from public 3D datasets, which can compensate for missing 3D geometries and assist the 3D reconstruction network to high fidelity results. Our method consists of three parts: 1) Cross-modal 3D shape retrieval module: This part retrieves related 3D shapes based on 2D images. Here, we apply the pre-trained model to guarantee the correlation between the retrieved 3D shape and the input image. 2) Multimodal information fusion module: We propose a multimodal attention mechanism to handle the information fusing of 2D visual and 3D structural information; 3) Three-dimensional reconstruction module: We propose a novel encoder-decoder network for 3D shape reconstruction. Specifically, we employ the skip connection operation to link the target image's visual information with the 3D model's structural information to enhance the prediction of 3D details. During training, we employ two carefully designed loss functions to lead the multimodal learning to obtain proper modal features. On the ShapeNet and Pix3D datasets, the final experimental results reveal that our method notably increases reconstruction quality and outperforms SOTA methods.

Abstract:
Temporal action localization aims at detecting the temporal intervals of human actions in untrimmed videos. Most previous methods rely on locating and matching the start and end times of actions. However, action boundaries are ambiguous and uncertain in nature, which leads to inaccurate action localization and a lot of false positives. In this paper, we introduce a new framework for temporal action localization. It explicitly models temporal action centers to reduce unreliable action detection results caused by ambiguous action boundaries. Since action centers are highly related to semantic actions, they can be detected more reliably than the conventional action boundaries. As a result, our framework can exclude false positives and promote high-quality proposals. Based on action centers, we propose a triplet feature fusion mechanism. It performs neural message passing among the boundaries and the center as well as contextual regions outside of the proposal to enrich its representation. In addition, we introduce a centerness scoring method to suppress proposals deviating from the centers of action instances. Consequently, our network can retrieve high-quality action proposals and locate actions more precisely. Experimental results show our method outperforms state-of-the-art methods on the THUMOS14 and ActivityNet v1.3 datasets.

Abstract:
Shadow removal, which aims to restore the illumination in shadow regions, is challenging due to the diversity of shadows in terms of location, intensity, shape, and size. Different from most multi-task methods, which design elaborate multi-branch or multi-stage structures for better shadow removal, we introduce feature decomposition to learn better feature representations. Specifically, we propose a single-stage and decoupled multi-task network (DMTN) to explicitly learn the decomposed features for shadow removal, shadow matte estimation, and shadow image reconstruction. First, we propose several coarse-to-fine semi-convolution (SMC) modules to capture features sufficient for joint learning of these three tasks. Second, we design a theoretically supported feature decoupling layer to explicitly decouple the learned features into shadow image features and shadow matte features via weight reassignment. Last, these features are converted to a target shadow-free image, affiliated shadow matte, and shadow image, supervised by multi-task joint loss functions. With multi-task collaboration, DMTN effectively recovers the illumination in shadow areas while ensuring the fidelity of non-shadow areas. Experimental results show that DMTN competes favorably with state-of-the-art multi-branch/multi-stage shadow removal methods, while maintaining the simplicity of single-stage methods.

Abstract:
In this paper, we propose a Siamese graph learning (SGL) approach to alleviate aging dataset bias. While numerous semi-supervised algorithms have been successfully applied to classification tasks, most of them assume that both the labeled and unlabeled samples are drawn from identical distributions. However, this assumption may not hold due to the heterogeneity of face aging data, which gives rise to a bias and unpromising prediction. Motivated by this, our SGL learns to align the sparse distribution with the dense one for dataset debias with preserving the real aging smoothness. To achieve this, we adopt a mixup strategy to plausibly generate hallucinatory samples, which leverages amounts of unlabeled data to enhance the diversity of unbalanced classes. Moreover, we develop a graph contrastive regularization to suppress the noise introduced by auxiliary unlabeled samples. Extensive experimental results show compelling performance by only utilizing the limited scalability of training annotations.

Abstract:
In this paper, we propose a Hierarchical Multimodal Variational Encoder-Decoder (HMMVED) to predict the popularity of micro-videos by comprehensively leveraging the user information and the micro-video content in a hierarchical fashion. In particular, the multimodal variational encoder-decoder framework encodes the input modalities to a lower dimensional stochastic embedding, from which the popularity of micro-videos can be decoded. Considering the leading role of the user’s social influence in social media for information dissemination, a user encoder-decoder is designed to learn the prior Gaussian embedding of the micro-video from the user information, which is informative about the coarse-grained popularity. In order to incorporate the fluctuation around the coarse-grained popularity caused by the diverse multimodal content, in the micro-video encoder-decoder, the refined posterior distribution of the micro-video embedding is encoded from the content features while encouraged to be close to the learned prior distribution. The fine-grained popularity of each micro-video is decoded from the posterior embedding of the micro-video. Based on the multimodal extension of variational information bottleneck theory, we show that the learned latent embeddings of micro-videos are maximally expressive about the popularity whilst maximally compressing the information from input modalities. Extensive experiments conducted on two real-world datasets demonstrate the effectiveness of the proposed method. Codes and datasets are available at: https://github.com/JennyXieJiayi/HMMVED.

Affiliations: Department of Computer Science, Shanghai Normal University, Shanghai, China; Department of Computer Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China; Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China; Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong; Biomedical and Multimedia Information Technology Research Group, School of Information Technologies, The University of Sydney, Sydney, NSW, Australia

Abstract:
Recent transformer-based models, especially patch-based methods, have shown huge potentiality in vision tasks. However, the split fixed-size patches divide the input features into the same size patches, which ignores the fact that vision elements are often various and thus may destroy the semantic information. Also, the vanilla patch-based transformer cannot guarantee the information communication between patches, which will prevent the extraction of attention information with a global view. To circumvent those problems, we propose an Efficient Attention Pyramid Transformer (EAPT). Specifically, we first propose the Deformable Attention, which learns an offset for each position in patches. Thus, even with split fixed-size patches, our method can still obtain non-fixed attention information that can cover various vision elements. Then, we design the Encode-Decode Communication module (En-DeC module), which can obtain communication information among all patches to get more complete global attention information. Finally, we propose a position encoding specifically for vision transformers, which can be used for patches of any dimension and any length. Extensive experiments on the vision tasks of image classification, object detection, and semantic segmentation demonstrate the effectiveness of our proposed model. Furthermore, we also conduct rigorous ablation studies to evaluate the key components of the proposed structure.

Abstract:
Face presentation attack detection (PAD) is an essentialmeasure to protect face recognition systems from being spoofed by malicious users and has attracted great attention from both academia and industry. Although most of the existing methods can achieve desired performance to some extent, the generalization issue of face presentation attack detection under cross-domain settings (e.g., the setting of unseen attacks and varying illumination) remains to be solved. In this paper, we propose a novel framework based on asymmetric modality translation for face presentation attack detection in bi-modality scenarios. Under the framework, we establish connections between two modality images of genuine faces. Specifically, a novel modality fusion scheme is presented that the image of one modality is translated to the other one through an asymmetric modality translator, then fused with its corresponding paired image. The fusion result is fed as the input to a discriminator for inference. The training of the translator is supervised by an asymmetric modality translation loss. Besides, an illumination normalization module based on Pattern of Local Gravitational Force (PLGF) representation is used to reduce the impact of illumination variation. We conduct extensive experiments on three public datasets, which validate that our method is effective in detecting various types of attacks and achieves state-of-the-art performance under different evaluation protocols.

Abstract:
Currently, with the rapid development of mobile Internet, micro-video has become a prevailing format of user-generated contents (UGCs) on various social media platforms. Several studies have been conducted towards to understanding high-level micro-video semantics, such as venue categorization, memorability, and popularity. However, these approaches supported tasks with only a single output, which exhibited limitations when attempting to use them to resolve tasks with multiple outputs, especially the multi-label micro-video classification. To tackle this problem, in this paper, we propose a dual multi-modal low-rank decomposition (DMLRD) method for multi-label micro-video classification tasks. To learn more comprehensive micro-video representations, we first learn the low-rank-regularized modality-specific and modality-shared components by considering the consistency and the complementarity among modalities simultaneously. Meanwhile, the less descriptive power of each modality aroused by inherent properties can be solved to a certain extent. To obtain unseen label representations, we next construct a sparsity-regularized multi-matrix normal estimation term to jointly encode the latent relationship structures among labels and dimensions. Experiments on two datasets demonstrate the effectiveness of our proposed method over the state-of-art methods.

Abstract:
Weakly supervised object detection (WSOD) aims to train object detectors by using only image-level annotations. Many recent works on WSOD adopt multiple instance detection networks (MIDN), which usually generate a certain number of proposals and regard proposal classification as a latent model learning within image classification. However, these methods tend to detect salient object, salient object parts and clustered objects due to lack of instance-level annotations during training. Thus a core issue is how to guarantee that the network learn as many objects with precise bounding boxes as possible. In this paper, we address this issue by exploiting the potential of proposal scores during training. We propose an adaptive instance refinement (AIR) framework with three novel designs, which can be integrated with MIDN into a single network. Specifically, adaptive instance mining attempts to discover all positive instances according to the score distribution of proposals and their spatial similarity. Adaptive score modulation dynamically adjusts proposal scores to make the network focus more on instances with different difficulties in different training iterations. Adaptive knowledge refinement distills important information from all previous stages by the weighted average of proposal scores. The experimental results on the PASCAL VOC 2007 and 2012 benchmarks and the MS COCO benchmark demonstrate that AIR significantly improves the performance of the original MIDN and achieves the state-of-the-art results.

Abstract:
Zero-Shot Learning (ZSL) aims to recognize unseen classes that never appear during training. Recently, generative adversarial networks (GANs) have been introduced to convert ZSL into a supervised learning problem by synthesizing unseen visual features. However, since unseen classes are never experienced for the generator during training, the synthesized unseen visual features often become heavily biased towards seen classes, or sometimes there is even no meaningful class that can be assigned to them. This is known as the bias problem. In this paper, we propose a novel method, namely Adaptive Bias-Aware GAN (ABA-GAN), to alleviate generating biased visual features. For this purpose, we build a semantic adversarial network to regularize the feature generator. Specifically, an adaptive adversarial loss is proposed to constrain the feature distributions, which avoids the generation of meaningless visual features. Meanwhile, a domain divider is presented to explicitly distinguish synthesized visual features between seen and unseen domains, such that the bias towards seen classes can be alleviated. Moreover, we propose a novel metric named bias score (BS) to explicitly quantify the degree of the strong bias. Extensive experiments on four widely used benchmark datasets demonstrate that our proposed method outperforms the state-of-the-art approaches under both ZSL and GZSL protocols.

Abstract:
The austere challenge of visual object tracking is to find the target to be tracked in various noise interference and obtain its accurate bounding box coordinates. Recently, the object tracking technology based on the Siamese network has made great breakthroughs, and more and more Siamese network trackers have been proposed with superior performance. They still have some shortcomings. To this end, a new Multi-Stage visual tracking algorithm with Siamese Anchor-Free Proposal Network (MS-SiamAFPN) is proposed in this paper. The algorithm is a three-stage Siamese network tracker composed of Feature Extraction and Fusion (FEF) sub-network, Classification and Regression (CR) sub-network, Validation and Regression (VR) sub-network in series. Firstly, the Anchor-Free Proposal Network (AFPN) module is designed in the CR stage, which can make full use of positive and negative samples for training while reducing neural network parameters. Secondly, aim to achieve better robustness and recognizability in the VR stage, on the one hand, a novel Feature Purification (FP) module is designed, which can automatically select the important channels, and extract the features of irregular regions on the input fusion features, so as to strengthen the representation ability of image features. On the other hand, the target recognition and position regression are regarded as different processing tasks, and the recognition score and position fine-tuning of candidate targets are obtained by newly designing the Dual-Branch Network (DBN) structure, thereby avoiding feature ambiguity. Due to the synergy of the above these innovations, MS-SiamAFPN has obtained a large performance improvement, and achieved SOTA performance in multiple public dataset benchmarks.

Abstract:
Single image deraining (SIDR) often suffers from over/under deraining due to the nonuniformity of rain densities and the variety of raindrop scales. In this paper, we propose a continuous density-guided network (CODE-Net) for SIDR. Particularly, it is composed of a rain streak extractor and a denoiser, where the convolutional sparse coding (CSC) is exploited to filter out noises from the extracted rain streaks. Inspired by the reweighted iterative soft-threshold (ISTA) for CSC, we address the problem of continuous rain density estimation by learning the weights with channel attention blocks from sparse codes. We further develop a multiscale strategy to depict rain streaks appearing at different scales. Experiments on synthetic and real-world data demonstrate the superiority of our methods over recent state-of-the-arts, in terms of both quantitative and qualitative results. Additionally, instead of quantizing rain density with several levels, our CODE-Net can provide continuous-valued estimations of rain densities, which is more desirable in real applications.

Affiliations: School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing, China; Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China; Institute of Deep Learning, Baidu Research and National Engineering Laboratory for Deep Learning Technology and Application, Beijing, China; Institute of North Electronic Equipment, Academy of Military Sciences, Beijing, China

Abstract:
Unmanned Aerial Vehicles (UAV) have many applications in both commerce and recreation. However, irresponsibly operated UAVs will pose a threat to public safety. Therefore, developing our understanding of UAVs and their uses is of particular interest. This paper considers tracking UAVs, which provide multifaceted information around location, paths and trajectories. To facilitate research on this topic, we introduce a new benchmark, herein referred to as Anti-UAV, which provides a novel direction for UAV tracking with more than 300 video pairs containing over 580 k manually annotated bounding boxes. Addressing anti-UAV research challenges could help to design anti-UAV systems, which in turn may improve surveillance. Accordingly, we have proposed a simple yet effective approach, called dual-flow semantic consistency (DFSC) is proposed for UAV tracking. Modulated by the semantic flow across video sequences, tracker learns more robust class-level semantic information and obtains more discriminative instance-level features. Experiments highlight significant performance gain with the proposed approach over state-of-the-art trackers and the challenging aspects of Anti-UAV. The Anti-UAV benchmark and the code for the proposed approach have been made publicly available at https://github.com/ucas-vg/Anti-UAV and https://github.com/ZhaoJ9014/Anti-UAV.

Abstract:
Surprising performance has been achieved in style transfer since deep learning was introduced to it. However, the existing state-of-the-art (SOTA) algorithms either suffer from quality issues or high computational complexity. The quality issues include shape retention and the adequacy of style migration, and the computational complexity is reflected in the network complexity and additional updates when the style changes. To deal with the above problems, we propose a novel low computational complexity arbitrary style transfer algorithm (LCCStyle) that mainly consists of a transformation feature module (TFM) and learning transformation module (LTM). The TFM is responsible for transforming the content feature map into the stylized feature map without impact on the integrity of content information, which contributes to good shape retention and full style migration. In addition, to avoid additional updates when the style changes, we propose a new training mechanism for arbitrary style transfer to directly generate the parameters of the TFM by a hyper-network. However, the widely used hyper-networks are composed of fully connected layers, which cause a large number of parameters. Hence, we designed a hyper-network (LTM) consisting of one-dimensional convolution to adapt to the characteristics of the Gram matrix of the style feature map, contributing to a small model size and having no impact on quality. Quantitative comparison and user study show that LCCStyle achieves high performance both on the adequacy of style migration and shape retention. Furthermore, compared with the SOTAs, the size of the proposed model is reduced by a large margin of nearly 51.4%～99.6%. When the input is 512×512 pixels, the processing speeds in the cases of unchanged style and constantly changing style are increased by at least 135% and 227%, respectively. On an Nvidia TITAN RTX GPU, LCCStyle reaches 60fps for 720p video and takes only 1 s to process 8 K images. https://github.com/HuangYujie94/LCCStyle.

Abstract:
Image and sentence matching is a critical task to bridge the visual and textual discrepancy due to the heterogeneous modalities. Great progress has been made by exploring the coarse-grained relationships between images and sentences or fine-grained relationships between regions and words. However, how to fully excavate and exploit corresponding relations between these two modalities is still challenging. In this work, we propose a novel Multi-scale Fine-grained Alignments Network (MFA), which can effectively explore multi-scale visual-textual correspondences to facilitate bridging cross-modal discrepancy. Specifically, word-scale matching module is firstly utilized to mine the basic but fundamental correspondences between a single word and independent region. Then, we propose a phrase-scale matching module to explore the relations between objects with the constraint of attribute and corresponding region, which can further reserve more associated information. To cope with the complex interactions among multiple phrases and images, we design the relation-scale matching module to capture high-order semantics between two modalities. Moreover, each matching module includes visual aggregation and textual aggregations, which can ensure the bi-directional coupling of multi-scale semantics. Extensive qualitative and quantitative experiments on two challenging datasets including Flickr30 K and MSCOCO, show that the proposed method achieves superior performance compared with the existing methods.

Abstract:
Curved scene text recognition is a challenging task in multimedia society due to large shape and texture variance. Previous methods address this challenge by extracting and rectifying text line with equidistantly sampling, which ignore character level information and lead to distorted characters. To address this issue, this paper proposes a Character-Aware Sampling and Rectification (CASR) module, which rectifies irregular text instance according to the location and orientation information of each individual character. Specifically, CASR regards each character as a basic unit and predicts the character-level attributes for sampling and rectification. Our module not only exploits detailed character information to obtain better rectification of text line, but also employs character-level supervision in training process. In addition, CASR provides a plug-and-play module which can be easily incorporated to existing text recognition pipeline. Extensive experiments on several benchmarks demonstrate that our method obtains more accurate rectified text instances and achieves promising performance. We will release our code and models in the future.

Abstract:
Recently, recommendation systems have been widely usedin online business scenarios, which can improve the online experience by learning the user or item characteristics to predict the user’s future behavior and to realize precision marketing. However, data sparsity and cold-start problems limit the performance of recommendation systems in some emerging fields. Thus, cross-domain recommendation has been proposed to handle the abovementioned problems. Nonetheless, many cross-domain recommendations only consider modeling a single user’s representation and ignore user-group information (this group has similar behavior and interests). Additionally, most studies are based on matrix factorization for generating embeddings, which results in a weak generalization ability of user latent features. In this paper, we propose a novel cross-domain recommendation model via User-Clustering and Multidimensional information Fusion (UCMF) that attempts to enhance user representation learning in a data sparsity scenario for accurate recommendation. In addition, we consider a user’s individual information and cross-domain feature information. A novel multidimensional information fusion is proposed to guarantee the robustness of the user features. In particular, we apply a graph neural network to learn the user-group features, which can effectively save the correlation among users’ information and guarantee feature performance. In other words, the Wasserstein autoencoder is utilized to learn the cross-domain user features, which can guarantee the consistency of user features from different domains. Experiments conducted on real-world datasets empirically demonstrate that our proposed method outperforms the state-of-the-art methods in cross-domain recommendation.

Abstract:
One of the important factors affecting micro-video recommender systems is to model the multi-modal user preference on the micro-video. Despite the remarkable performance of prior arts, they are still limited by fusing the user preference derived from different modalities in a unified manner, ignoring the users tend to place different emphasis on different modalities. Furthermore, modality-missing is ubiquity and unavoidable in the micro-video recommendation, some modalities information of micro-videos are lacked in many cases, which negatively affects the multi-modal fusion operations. To overcome these disadvantages, we propose a novel framework for the micro-video recommendation, dubbed Dual Graph Neural Network (DualGNN), upon the user-microvideo bipartite and user co-occurrence graphs, which leverages the correlation between users to collaboratively mine the particular fusion pattern for each user. Specifically, we first introduce a single-modal representation learning module, which performs graph operations on the user-microvideo graph in each modality to capture single-modal user preferences on different modalities. And then, we devise a multi-modal representation learning module to explicitly model the user’s attentions over different modalities and inductively learn the multi-modal user preference. Finally, we propose a prediction module to rank the potential micro-videos for users. Extensive experiments on two public datasets demonstrate the significant superiority of our DualGNN over state-of-the-arts methods.

Abstract:
For visual-semantic embedding, the existing methods normally treat the relevance between queries and candidates in a bipolar way – relevant or irrelevant, and all “irrelevant” candidates are uniformly pushed away from the query by an equal margin in the embedding space, regardless of their various proximity to the query. This practice disregards relatively discriminative information and could lead to suboptimal ranking in the retrieval results and poorer user experience, especially in the long-tail query scenario where a matching candidate may not necessarily exist. In this paper, we introduce a continuous variable to model the relevance degree between queries and multiple candidates, and propose to learn a coherent embedding space, where candidates with higher relevance degrees are mapped closer to the query than those with lower relevance degrees. In particular, the new ladder loss is proposed by extending the triplet loss inequality to a more general inequality chain, which implements variable push-away margins according to respective relevance degrees. To adapt to the varying mini-batch statistics and improve the efficiency of the ladder loss, we also propose a Silhouette score-based method to adaptively decide the ladder level and hence the underlying inequality chain. In addition, a proper Coherent Score metric is proposed to better measure the ranking results including those “irrelevant” candidates. Extensive experiments on multiple datasets validate the efficacy of our proposed method, which achieves significant improvement over existing state-of-the-art methods.

Abstract:
Early actionrecognition, i.e., recognizing an action before it is fully performed, is a challenging and important task. Existing works mainly focus on deterministic early action recognition outputting only a single class, and ignore the uncertainty and diversity that essentially exist in this task. Intuitively, when only the early portion of the action is observed, there could be multiple possibilities of the full action, as diversified actions can share almost identical early segments in many scenarios. Thus taking uncertainties and diversities into account, and outputting multiple plausible predictions, instead of a single one, can be important for the sake of authenticity and requirement of many practical applications. To this end, we propose a novel Diversified Early Action Recognition Network (Dear-Net) that is capable of outputting multiple reasonable action classes for each partial sequence by utilizing mode conversion. Specifically, we introduce an effective action diversity learning strategy to drive our network towards predicting diverse and reasonable results, in which each learnable action class is matched with the most suitable mode. Meanwhile, the collapsed modes which fail to receive any action class, are also considered in this strategy in order to ensure diversity. Moreover, we design a sequence decoder within our network to capture latent global information for better early action recognition. It provides a feasible scheme for weakly-supervised setting in which the Dear-Net leverages unlabelled data to improve performance. Experimental results on three challenging datasets clearly show the effectiveness of our approach.

Abstract:
Two-stage person search methods achieve the state-of-the-art performance by separate detection and re-ID stages, but neglect the consistency needs between these two stages. The re-ID stage needs more accurate query bounding boxes and fewer boxes of distractors; The detection stage needs the re-ID stage to have robustness against unavailable detection errors. In this paper, we introduce a novel Bi-directional Task-Consistent Learning (BTCL) person search framework, including a Target-Specific Detector (TSD) and a re-ID model with Dynamic Adaptive Learning Structure (DALS). For the former consistency need, we add a verification head for predicting the similarity scores between query and proposals in parallel with the existing heads for bounding box recognition. Thus, TSD generates accurate boxes for the query-like pedestrians, which are suitable for the re-ID stage. For the re-ID robustness need, DALS dynamically generates a large number of possible detection results in line with the real distribution. By training the re-ID model on data with different types of detection errors, DLAS improves the model robustness to detection inputs. Experimental results show our framework achieves state-of-the-art performance on two widely-used person search datasets.

Abstract:
Cross-modal hashing has become a vital technique in cross-modal retrieval due to its fast query speed and low storage cost in recent years. Generally, most of the priors supervised cross-modal hashing methods are flat methods which are designed for non-hierarchical labeled data. They treat different categories independently and ignore the inter-category correlations. In practical applications, many instances are labeled with hierarchical categories. The hierarchical label structure provides rich information among different categories. To rationally take use of category correlations, hierarchical cross-modal hashing is proposed. However, existing methods intend to preserve instance-pairwise or class-pairwise similarities, which cannot fully explore the semantic correlations among different categories and make the learned hash codes less discriminative. In this paper, we propose a deep cross-modal hashing method named hierarchical semantic structure preserving hashing (HSSPH), which directly exploits the label hierarchy information to learn discriminative hash codes. Specifically, HSSPH learns a set of class-wise hash codes for each layer. By augmenting class-wise codes with labels, it generates layer-wise prototype codes which reflect the semantic structure of each layer. In order to enhance the discriminative ability of hash codes, HSSPH supervises the hash codes learning with both labels and semantic structures to preserve the hierarchical semantics. Besides, efficient optimization algorithms are developed to directly learn the discrete hash codes for each instance and each class. Extensive experiments on two benchmark datasets show the superiority of HSSPH over several state-of-the-art methods.

Abstract:
This paper aims at robust and discriminative feature learning for target re-identification (Re-ID). In addition to paying attention to the individual appearance information as in most Re-ID methods, we further utilize the abundant contextual information as additional clues to guide the feature learning. Graph as a format of structured data is used to represent the target sample with its context. It describes the first-order appearance information of the samples and the second-order topological relationship information among samples, based on which we compute the feature representation by learning a graph feature embedding. We provide a detailed analysis of graph convolutional network mechanism applied in target Re-ID and propose a novel progressive context-aware graph feature learning method, in which the message passing is dominated by a pre-defined adjacency relationship followed by a learned relationship in a self-adaptive way. The proposed method fully exploits and utilizes contextual information at a low cost for Re-ID. Extensive experiments on five Re-ID benchmarks demonstrate the state-of-the-art performance of the proposed method.

Abstract:
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.

Abstract:
Image Virtual try-on aims at replacing the cloth on a personal image with a garment image (in-shop clothes), which has attracted increasing attention from the multimedia and computer vision communities. Prior methods successfully preserve the character of clothing images, however, occlusion remains a pernicious effect for realistic virtual try-on. In this work, we first present a comprehensive analysis of the occlusions and categorize them into two aspects: i) Inherent-Occlusion: the ghost of the former cloth still exists in the try-on image; ii) Acquired-Occlusion: the target cloth warps to the unreasonable body part. Based on the in-depth analysis, we find that the occlusions can be simulated by a novel semantically-guided mixup module, which can generate semantic-specific occluded images that work together with the try-on images to facilitate training a de-occlusion try-on (DOC-VTON) framework. Specifically, DOC-VTON first conducts a sharpened semantic parsing on the try-on person. Aided by semantics guidance and pose prior, various complexities of texture are selectively blending with human parts in a copy-and-paste manner. Then, the Generative Module (GM) is utilized to take charge of synthesizing the final try-on image and learning to de-occlusion jointly. In comparison to the state-of-the-art methods, DOC-VTON achieves better perceptual quality by reducing occlusion effects.

Abstract:
Occlusion handling in crowded scenes is an intractable challenge for human pose estimation. To address this problem, we propose two novel feed-forward network structures named Global Feed-Forward Network (GFFN) and Dynamic Feed-Forward Network (DFFN), which are specifically designed for image-based tasks to capture both local and global contextual information within intermediate features and update feature representations with high adaptability for occlusions. By exploiting the context modeling ability of the proposed GFFN and DFFN, we present a novel backbone network, namely High-Resolution Context Network (HRNeXt), which learns high-resolution representations with abundant contextual information to better estimate poses of occluded human bodies. Compared to state-of-the-art pose estimation networks, our HRNeXt absorbs advantages of convolution operation and attention mechanism, and it is more efficient in terms of training data sizes, network parameters and computational costs. Experimental results show that our HRNeXt significantly outperforms state-of-the-art backbone networks on challenging pose estimation datasets with high occurrence of crowds and occlusions.

Abstract:
In this paper, we are tackling the weakly referring expression grounding task to localize the target object in an image according to a given query sentence, where the mapping between the query sentence and image regions is blind during the training period. Previous methods all follow a cyclic forward-backward pipeline to handle this task, where the query sentence is firstly converted to the result region through the forward module, and then the result region is converted back to a sentence through the backward module, with the difference between the reconstructed sentence and original query used as the loss to optimize the entire network. These existing methods, however, suffer from the deviation issue when the result region, generated through the forward module, totally deviates from the target area, but the backward module still reconstructs a similar sentence. The aforementioned loss function cannot penalize this kind of deviation because of the consistent prediction of the sentence. To overcome this limitation, we propose a cycle-free pipeline, where a region describer network is designed to predict the textual description for each candidate region, and a result region is selected according to the similarity between the predicted description and the query sentence. Furthermore, a self-paced learning mechanism is designed to avoid the drift issue during the warm-up period of the optimization process. The proposed method achieves a higher average accuracy on RefCOCO and RefCOCO+ datasets, compared with all previous state-of-the-art methods.

Abstract:
Most of the existing semantic segmentation approaches with image-level class labels as supervision, highly rely on the initial class activation map (CAM) generated from the standard classification network. In this paper, a novel “Progressive Patch Learning” approach is proposed to improve the local details extraction of the classification, producing the CAM better covering the whole object rather than only the most discriminative regions as in CAMs obtained in conventional classification models. “Patch Learning” destructs the feature maps into patches and independently processes each local patch in parallel before the final aggregation. Such a mechanism enforces the network to find weak information from the scattered discriminative local parts, achieving enhanced local details sensitivity. “Progressive Patch Learning” further extends the feature destruction and patch learning to multi-level granularities in a progressive manner. Cooperating with a multi-stage optimization strategy, such a “Progressive Patch Learning” mechanism implicitly provides the model with the feature extraction ability across different locality-granularities. As an alternative to the implicit multi-granularity progressive fusion approach, we additionally propose an explicit method to simultaneously fuse features from different granularities in a single model, further enhancing the CAM quality on the full object coverage. Our proposed method achieves outstanding performance on the PASCAL VOC 2012 dataset (e.g., with 69.6% mIoU on the test set), which surpasses most existing weakly supervised semantic segmentation methods.

Abstract:
The learning process of deep learning methods usually updates the model’s parameters in multiple iterations. Each iteration can be viewed as the first-order approximation of Taylor’s series expansion. The remainder, which consists of higher-order terms, is usually ignored in the learning process for simplicity. This learning scheme empowers various multimedia-based applications, such as image retrieval, recommendation system, and video search. Generally, multimedia data (e.g. images) are semantics-rich and high-dimensional, hence the remainders of approximations are possibly non-zero. In this work, we consider that the remainder is informative and study how it affects the learning process. To this end, we propose a new learning approach, namely gradient adjustment learning (GAL), to leverage the knowledge learned from the past training iterations to adjust vanilla gradients, such that the remainders are minimized and the approximations are improved. The proposed GAL is model- and optimizer-agnostic, and is easy to adapt to the standard learning framework. It is evaluated on three tasks, i.e. image classification, object detection, and regression, with state-of-the-art models and optimizers. The experiments show that the proposed GAL consistently enhances the evaluated models, whereas the ablation studies validate various aspects of the proposed GAL. The code is available at https://github.com/luoyan407/gradient_adjustment.git.

Abstract:
The detection results of many existing co-saliency detection methods are easily interfered by the unrelated salient objects, which have similar appearance characteristics to co-salient objects. Therefore, mining the inter-saliency cues which contain the common category information of multiple related images is the core of co-saliency detection. To address above concern, a novel group weakly supervised learning induced co-saliency detection (GWSCoSal) model is proposed in this paper. First of all, a novel group class activation maps (GCAM) network is constructed and trained through a group weakly supervised learning scheme, which adopts the common category of a group of related images as the ground truth. The GCAM produced by the trained GCAM network are considered as the inter-saliency cues, which can only highlight the regions covered by the objects with common category. Afterwards, the GCAM are integrated into a feature pyramid networks (FPN) based backbone trained by the pixel-level labels to infer the co-saliency maps. The group weakly supervised and the pixel-level learning are jointly implemented for end-to-end training of GWSCoSal model. The comprehensive comparisons with 13 state-of-the-art methods demonstrate that, our GWSCoSal model can detect the co-salient objects more accurately under the condition of being interfered by the similar unrelated salient objects, and the overall performance of which has achieved the level of state-of-the-art methods. The ablation study of our GWSCoSal model validates the effectiveness of proposed GCAM network.

Abstract:
In this paper, we aim to devise a new framework to compel the network to be equipped with the capability of detecting objects using image-level class labels as supervision. The challenge of such a weakly supervised setting mainly lies in how to make the network accurately understand both semantics and objectness of a given proposal without bounding box annotations. To this end, we contribute a concise framework, named Class Prototypical Network (CPNet). Concretely, our CPNet defines a set of learnable class prototypes to help classify object proposals. To endow the prototypes be not only discriminative for classes but also sensitive for proposals' objectness, we conduct both class-aware cross-attention and location-aware cross-attention between the feature embeddings of the learnable prototypes and the proposals. The learned attention scores are then used to form the proposal-level category information into the image-level one, making the entire framework be trained without any bounding box annotations. Besides, by applying these two kinds of attention mechanisms, the knowledge from both proposals' location and its class information can be successfully transferred into the corresponding prototypes. With the help of prototypes, our CPNet detects true positive object proposals. In addition, the CPNet further introduces a multi-head detection head to perform complementary training, preventing the model from falling into local discriminative parts and improving the model's performance on challenging non-rigid categories. We examine our CPNet on popular benchmarks, i.e., PASCAL VOC 2007, 2012 and MS COCO 2014. Extensive experiments show our CPNet is a simple and effective framework.

Abstract:
Existing solutions for weakly supervised object detection (WSOD) generally follow the multiple instance learning (MIL) paradigm to formulate WSOD as a multi-class classification problem over a set of region proposals. However, without the supervision signal of ground-truth boxes, the training objective of multi-class classification makes the detectors devote main efforts to finding the most common pattern of each class, as the common pattern is always the most discriminative evidence for classification. In addition, although learning from distinguishing multiple foreground classes, the detectors can still ignore to differentiate foreground regions from the background ones, which causes false alarm in prediction. These two points account for the limited localization capability of MIL-based WSOD methods. To this end, we propose foreground information guided WSOD (FI-WSOD), a novel framework that introduces an extra foreground-background binary classification (F-BBC) sub-task to the original MIL-based WSOD paradigm. At the training stage, the involvement of F-BBC task not only improves the feature representation of the network, but also provides extra information from the foreground-background perspective. By leveraging the learnt foreground information, a Foreground Guided Self-Training (FGST) module is further proposed to filter out noisy samples, and to mine representative seeds from the remaining proposals. Moreover, a Multi-Seed Training strategy is performed to reduce the impact of noisy labels when training the self-training networks in FGST. We have conducted extensive experiments on the prevalent Pascal VOC 2007, Pascal VOC 2012 and MSCOCO datasets, and report a series of state-of-the-art records achieved by our proposed framework.

Abstract:
The temporal action segmentation task segments videos temporally and predicts action labels for all frames. Fully supervising such a segmentation model requires dense frame-wise action annotations, which are expensive and tedious to collect. This work is the first to propose a Constituent Action Discovery (CAD) framework that only requires the video-wise high-level complex activity label as supervision for temporal action segmentation. The proposed approach automatically discovers constituent video actions using an activity classification task. Specifically, we define a finite number of latent action prototypes to construct video-level dual representations with which these prototypes are learned collectively through the activity classification training. This setting endows our approach with the capability to discover potentially shared actions across multiple complex activities. Due to the lack of action-level supervision, we adopt the Hungarian matching algorithm to relate latent action prototypes to ground truth semantic classes for evaluation. We show that with the high-level supervision, the Hungarian matching can be extended from the existing video and activity levels to the global level. The global-level matching allows for action sharing across activities, which has never been considered in the literature before. Extensive experiments demonstrate that our discovered actions can help perform temporal action segmentation and activity recognition tasks.

Abstract:
Few-shot learning is a tough topic to solve since obtaining a large number of training samples in real applications is challenging. It has attracted increasing attention recently. Meta-learning is a prominent way to address this issue, intending to adapt predictors as base-learners to new tasks swiftly. However, a key challenge of meta-learning is its lack of expressive capacity, which stems from the difficulty of extracting general information from a small number of training samples. As a result, the generalizability of meta-learners trained from high-dimensional parameter spaces is frequently limited. To learn a better representation, we propose a graph complemented latent representation (GCLR) network for few-shot image classification. In particular, we embed the representation into a latent space, in which the latent codes are reconstructed using variational information to enrich the representation. In this way, the latent representation can achieve better generalizability. Another benefit is that, because the latent space is formed using variational inference, it cooperates well with various base-learners, boosting robustness. To make full use of the relation between samples in each category, a graph neural network (GNN) is also incorporated to improve relation mining. Consequently, our end-to-end framework delivers competitive performance on three few-shot learning benchmarks for image classification.

Abstract:
This paper proposes a decoder-side Cross Resolution Synthesis (CRS) module to pursue better compression efficiency beyond the latest Versatile Video Coding (VVC), where we encode intra frames at original high resolution (HR), compress inter frames at a lower resolution (LR), and then super-resolve decoded LR inter frames with the help from preceding HR intra and neighboring LR inter frames. For a LR inter frame, a motion alignment and aggregation network (MAN) is devised to produce temporally aggregated motion representation to best guarantee the temporal smoothness; Another texture compensation network (TCN) is utilized to generate texture representation from decoded HR intra frame for better augmenting spatial details; Finally, a similarity-driven fusion engine synthesizes motion and texture representations to upscale LR inter frames for the removal of compression and resolution re-sampling noises. We enhance the VVC using proposed CRS, showing averaged 8.76% and 11.93% Bjøntegaard Delta Rate (BD-Rate) gains against the latest VVC anchor in Random Access (RA) and Low-delay P (LDP) settings respectively. In addition, experimental comparisons to the state-of-the-art super-resolution (SR) based VVC enhancement methods, and ablation studies are conducted to further report superior efficiency and generalization of the proposed algorithm. All materials will be made to public at https://njuvision.github.io/CRS for reproducible research.

Abstract:
To achieve real-time online search, most image retrieval methods aim to learn compact feature representation while keeping their semantic information or intra-class relevance. In this paper, we propose a new compact feature learning method to embed the underlying manifold information from database. It integrates deep convolutional neural network (CNN) and graph convolutional neural networks (GCN) into a unified end-to-end learning framework. In the proposed method, the deep feature extracted by CNN is automatically embedded with the information from its neighbors by GCN, which possesses the ability of exploring the semantic relevance on the database manifold. Since constructing a graph over the whole database costs unaffordable memory, we build a landmark graph as database sketch. The landmark graph contains two kinds of nodes, including codewords and memory bank samples. Given an image, the deep architecture outputs the discriminative feature and its similarity with all the graph nodes. We directly use the indices of the most similar codeword nodes as the compact feature representation. To make the proposed method scalable to large datasets, a multi-graph strategy is adopted to generate compact features with adaptable code length. The experiments on two benchmark datasets demonstrate the effectiveness of the proposed method.

Abstract:
Numerous bottom-up salient object detection algorithms formulate the problem as a classification task. For an input image, these methods usually utilize prior cues to select some regions as training set, and learn a classifier to classify all regions into foreground/background. However, such binary classification based approaches suffer from accuracy problems in some complex scenes. To this end, we propose a novel framework, namely Multi-Subclass Classification with Label Distribution Learning (MSCLDL). Specifically, prior knowledge is firstly employed to build a training set from input image, in which each sample is associated with one of two class labels. Previous works usually learn directly a binary classification model from training set. Different with them, we further decompose two classes into a certain number of subclasses, each sample is thus described by one of multiple subclass labels. Based on the multi-subclass training set, we learn a label distribution model to predict the subclass label of each image region. Furthermore, the saliency value of each image region could be computed via exploring the relationship class and subclass labels. The MSCLDL could overcome the limitation of existing classification-based algorithms in some challenging scenes. Finally, a novel refinement technology is presented to further refine the saliency map obtained by MSCLDL. We compare the proposed method and other state-of-the-art methods on four benchmark datasets, the superiority of our model is adequately demonstrated via the experimental results analysis.

Abstract:
Recognition of emotions conveyed in images has attracted increasing research attention. Recent studies show that leveraging local affective regions helps to improve the recognition performance. However, these studies do not consider features from the broad context of the local affective regions, which could provide useful information for learning improved emotion representations. In this paper, we present a region-based multiscale network that learns features for the local affective region as well as the broad context for affective image recognition. The proposed network consists of an affective region detection module and a multiscale feature learning module. The class activation mapping method is used to generate pseudo affective regions from a pretrained deep neural network to train the detection module. For the affective region outputted by the detection module, three-scale features are extracted and then encoded by a kernel-based graph attention network for final emotion classification. We show that integrating features from the broad context is effective in improving the recognition performance. We experimentally evaluate the proposed network for both multi-class emotion recognition and binary sentiment classification on different benchmark datasets. The experimental results demonstrate that the proposed network achieves improved or comparable performance as compared to previous state-of-the-art models.

Abstract:
With the recent surge in autonomous driving vehicles, the need for accurate vehicle detection and tracking is critical now more than ever. Detecting vehicles from visual sensors fails in non-line-of-sight (NLOS) settings. This can be compensated by the inclusion of other modalities in a multi-domain sensing environment. We propose several deep learning based frameworks for fusing different modalities (image, radar, acoustic, seismic) through the exploitation of complementary latent embeddings, incorporating multiple state-of-the-art fusion strategies. Our proposed fusion frameworks considerably outperform unimodal detection. Moreover, fusion between image and non-image modalities improves vehicle tracking and detection under NLOS conditions. We validate our models on the real-world multimodal ESCAPE dataset, showing 33.16% improvement in vehicle detection by fusion (over visual inference alone) over test scenarios with 30-42% NLOS conditions. To demonstrate how well our framework generalizes, we also validate our models on the multimodal NuScene dataset, showing ～22% improvement over competing methods.

Abstract:
Dense captioning methods generally detect events in videos first and then generate captions for the individual events. Events are localized solely based on the visual cues while ignoring the associated linguistic information and context. Whereas end-to-end learning may implicitly take guidance from language, these methods still fall short of the power of explicit modeling. In this paper, we propose a Visual-Semantic Embedding (ViSE) Framework that models the word(s)-context distributional properties over the entire semantic space and computes weights for all the n-grams such that higher weights are assigned to the more informative n-grams. The weights are accounted for in learning distributed representations of all the captions to construct a semantic space. To perform the contextualization of visual information and the constructed semantic space in a supervised manner, we design Visual-Semantic Joint Modeling Network (VSJM-Net). The learned ViSE embeddings are then temporally encoded with a Hierarchical Descriptor Transformer (HDT). For caption generation, we exploit a transformer architecture to decode the input embeddings into natural language descriptions. Experiments on the large-scale ActivityNet Captions dataset and YouCook-II dataset demonstrate the efficacy of our method.

Abstract:
Since annotating pedestrians across different views is extremely costly, intra-camera supervised person re-identification (ReID) aims to learn a ReID model from the intra-view labeled data. Under this setting, the most challenge lies in learning a view-invariant feature embedding in the absence of the cross-view annotations. Previous works focus on assigning a pseudo identity label for each image based on the feature similarity and learn view-invariant features by classification loss. However, because of the cross-view variations in lighting, background, etc., the pseudo labels are often noisy, and therefore not reliable for classification. In this paper, we explore learning a consistent discrepancy for pairwise images. Our main idea is that the discrepancy between pedestrian images should be consistent across different views regardless of view change so that it mainly depicts the identity difference. Due to the lack of cross-view annotations, we project images into different views and obtain likelihood prototypes for cross-view learning. These likelihood prototypes are used to measure the discrepancies between pairwise images under different views. And then, we propose an intra-view discrepancy preservation module to enforce the discrepancy to be view-consistent so as to encourage the model to distinguish the images based on the identities regardless of view change. Extensive experiments on multiple datasets show that our method outperforms existing related methods by clear margins and our method is comparable to supervised counterparts. Code will be made publicly available.

Abstract:
Multimedia applications often involve knowledge transfer across domains, e.g., from images to texts, where Unsupervised Domain Adaptation (UDA) can be used to reduce the domain shifts. Most of the UDA methods are based on adversarial learning. However, previous adversarial domain adaptation methods may suffer from three issues. First, although the features learned by previous methods could fool the domain classifier to make false classification predictions, they may not be domain-invariant. Second, the limited number of training samples make the latent space of features not smooth and continuous enough. Third, the target domain features may lack discriminability. In this paper, we propose a novel adversarial domain adaptation method named Adversarial Mixup Ratio Confusion (AMRC) to alleviate all the above issues. Specifically, we propose a new adversarial training pattern that uses mixup to generate multiple features with different mixup ratios, which represent different intermediate states between the source and target domain. Then, on one hand, we train an estimator to estimate the mixup ratio as accurately as possible. On the other hand, we train a generator to make the estimator be uncertain about the mixup ratio. In this way, our method could learn a continuous and domain-invariant latent space. Furthermore, we apply the intra-domain and cross-domain mixup regularizations to ensure the smoothness and continuity of the latent space, while making the classifier behave more linearly on in-between samples. At last, we exploit the sharpened pseudo-labels of the target samples for self-supervised learning to enhance the discriminability of the target features.The experimental results on 3 benchmarks verify the effectiveness of our method.

Abstract:
Unsupervised Domain Adaptive (UDA) object re-identification (Re-ID) aims at adapting a model trained on a labeled source domain to an unlabeled target domain. State-of-the-art object Re-ID approaches adopt clustering algorithms to generate pseudo-labels for the unlabeled target domain. However, the inevitable label noise caused by the clustering procedure significantly degrades the discriminative power of Re-ID model. To address this problem, we propose an uncertainty-aware clustering framework (UCF) for UDA tasks. First, a novel hierarchical clustering scheme is proposed to promote clustering quality. Second, an uncertainty-aware collaborative instance selection method is introduced to select images with reliable labels for model training. Combining both techniques effectively reduces the impact of noisy labels. In addition, we introduce a strong baseline that features a compact contrastive loss. Our UCF method consistently achieves state-of-the-art performance in multiple UDA tasks for object Re-ID, and significantly reduces the performance gap between unsupervised and supervised Re-ID. In particular, the performance of our unsupervised UCF method in the MSMT17\toMarket1501 task is better than that of the fully supervised setting on Market1501. The code of UCF is available at https://github.com/Wang-pengfei/UCF.

Abstract:
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID). In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework. Nevertheless, detection and ReID demand diverse features. This issue results in an optimization contradiction during the training procedure. With the target of alleviating this contradiction, we devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings. As such, this module provides an implicit manner to balance the different requirements of these two subtasks. Moreover, we observe that preceding MOT methods typically leverage local information to associate the detected targets and neglect to consider the global semantic relation. To resolve this limitation, we develop a module, referred to as Guided Transformer Encoder (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention. Unlike previous works, GTE avoids analyzing all the pixels and only attends to capture the relation between query nodes and a few self-adaptively selected key samples. Therefore, it is computationally efficient. Extensive experiments have been conducted on the MOT16, MOT17 and MOT20 benchmarks to demonstrate the superiority of the proposed MOT framework, namely RelationTrack. The experimental results indicate that RelationTrack has surpassed preceding methods significantly and established a new state-of-the-art performance, e.g., IDF1 of 70.5% and MOTA of 67.2% on MOT20.

Abstract:
Calibration is a common method for steganalysis, and Intra Prediction Mode (IPM) shift is a typical phenomenon used in calibration to detect video steganography. The current HEVC steganography lacks resistance to steganalysis based on this phenomenon because the new technology of HEVC introduces steganographic distortion in addition to providing more potential steganographic space. In this paper, an HEVC steganographic algorithm that resists IPM shift is proposed. First, we introduce the IPM shift in HEVC, and the previous H.264 steganalytic IPM shift feature is modeled and improved. By analyzing the HEVC encoding process, we found that modifying large-size blocks has a more significant impact on compression efficiency, while small ones are more sensitive to IPM optimality. Therefore, we perform the embedding channel division based on the block size and design the distortion function separately. In addition, we discover a unique IPM transition probability distribution in HEVC. According to our analysis, this unique distribution arises due to HEVC’s MPM rules and the regularity of IPM direction. Modifying IPM in HEVC will change such distribution, thus, a mapping rule is designed based on this distribution to achieve a better embedding effect. Experimental results show that the channel division and proposed distortion function can effectively improve the overall performance. The proposed steganography outperforms the state-of-the-art steganography in resisting steganalysis, bitrate controlling, and visual quality.

Abstract:
In hyperspectral imagery, target detection algorithms are usually based on the spectral signature information. Due to the advance of the spatial resolution of hyperspectral sensors, the ground sample distance may be much smaller than the size of targets. As a result, targets often occupy multiple consecutive pixels, which are referred to as multi-pixel targets. In this paper, we investigate the target detection problem for multi-pixel targets in hyperspectral imagery, when the target spectral signature is known. Jointly exploiting the pixels occupied by a target of interest, we propose a multi-pixel target detector resorting to the generalized likelihood ratio test criterion. Closed-form expressions for the probabilities of the false alarm and detection are derived, which are verified using Monte Carlo simulations. Experimental results on four real hyperspectral datasets show that the proposed detector outperforms its counterparts.

Abstract:
Human perception systems can integrate audio and visual information automatically to obtain a profound understanding of real-world events. Accordingly, fusing audio and visual contents is important to solve the audio-visual event (AVE) localization problem. Although most existing works have fused audio and visual modalities to explore their relationship with attention-based networks, we can delve into their relationship more deeply to improve the fusion capability of the two modalities. In this paper, we propose a dense modality interaction network (DMIN) to elegantly leverage audio and visual information by integrating two novel modules, namely, the audio-guided triplet attention (AGTA) module and the dense inter-modality attention (DIMA) module. The AGTA module enables audio information to guide the network to pay more attention to event-relevant visual regions. This guidance is conducted in the channel, temporal, and spatial dimensions, which emphasize informative features, temporal relationships and spatial regions, to boost the capacity of representations. Furthermore, the DIMA module establishes the dense-relationship between audio and visual modalities. Specifically, the DIMA module leverages the information of all channel pairs of audio and visual features to formulate the cross-modality attention weight, which is superior to the multi-head attention module that uses limited information. Moreover, a novel unimodal discrimination loss (UDL) is introduced to exploit the unimodal and fused features together for more exact AVE localization. The experimental results show that our method is remarkably superior to the state-of-the-art methods in fully- and weakly-supervised AVE settings. To further evaluate the model’s ability to build audio-visual connections, we design a dense cross modality relation network (DCMR) to solve the cross-modality localization task. DCMR is a simple deformation of a DMIN, and the experimental results further illustrate that DIMA can explore denser relationships between the two modalities. Code is available at https://github.com/weizequan/DMIN.git.

Abstract:
We propose a Video Colorization with Hybrid Generative Adversarial Network (VCGAN), an improved approach to video colorization using end-to-end learning and recurrent architecture. The VCGAN addresses two prevalent issues in the video colorization domain: Temporal consistency and the unification of colorization network and refinement network into a single architecture. To enhance colorization quality and spatiotemporal consistency, the mainstream of the generator in VCGAN is assisted by two additional networks, i.e., global feature extractor and placeholder feature extractor, respectively. The global feature extractor encodes the global semantics of grayscale input to enhance colorization quality, whereas the placeholder feature extractor serves as a feedback connection to encode the semantics of the previous colorized frame in order to maintain spatiotemporal consistency. If changing the input for placeholder feature extractor as grayscale input, the hybrid VCGAN also has the potential to colorize single images. To improve the color consistency of far frames, we propose a dense long-term loss that minimizes the temporal disparity of every two remote frames. Trained with colorization and temporal losses jointly, VCGAN strikes a good balance between video color vividness and spatiotemporal continuity. Experimental results demonstrate that VCGAN produces higher-quality and temporally more consistent colorful videos than existing approaches.

Abstract:
Multi-view learning has progressed rapidly in recent years. Although many previous studies assume that each instance appears in all views, it is common in real-world applications for instances to be missing from some views, resulting in incomplete multi-view data. To tackle this problem, we propose a novel Latent Heterogeneous Graph Network (LHGN) for incomplete multi-view learning, which aims to use multiple incomplete views as fully as possible in a flexible manner. By learning a unified latent representation, a trade-off between consistency and complementarity among different views is implicitly realized. To explore the complex relationship between samples and latent representations, a neighborhood constraint and a view-existence constraint are proposed, for the first time, to construct a heterogeneous graph. Finally, to avoid any inconsistencies between training and test phase, a transductive learning technique is applied based on graph learning for classification tasks. Extensive experimental results on real-world datasets demonstrate the effectiveness of our model over existing state-of-the-art approaches. Our code is available at: https://github.com/yxjdarren/LHGN_TMM_2022.

Abstract:
In this paper, we consider the lifelong age progression and regression task, which requires to synthesize a person's appearance across a wide range of ages. We propose a simple yet effective learning framework to achieve this by exploiting the prior knowledge of faces captured by well-trained generative adversarial networks (GANs). Specifically, we first utilize a pretrained GAN to synthesize face images with different ages, with which we then learn to model the conditional aging process in the GAN latent space. Moreover, we also introduce a cycle consistency loss in the GAN latent space to preserve a person's identity. As a result, our model can reliably predict a person's appearance for different ages by modifying both shape and texture of the head. Both qualitative and quantitative experimental results demonstrate the superiority of our method over concurrent works. Furthermore, we demonstrate that our approach can also achieve high-quality age transformation for painting portraits and cartoon characters without additional age annotations.

Abstract:
As a subtopic of text-to-image synthesis, text-to-face generation has great potential in face-related applications. In this paper, we propose a generic text-to-face framework, namely, TextFace, to achieve diverse and high-quality face image generation from text descriptions. We introduce text-to-style mapping, a novel method where the text description can be directly encoded into the latent space of a pretrained StyleGAN. Guided by our text-image similarity matching and face captioning-based text alignment, the textual latent code can be fed into the generator of a well-trained StyleGAN to produce diverse face images with high resolution (1024×1024). Furthermore, our model inherently supports semantic face editing using text descriptions. Finally, experimental results quantitatively and qualitatively demonstrate the superior performance of our model.

Abstract:
Attention-based Transformer models have been increasingly employed for automatic music generation. To condition the generation process of such a model with a user-specified sequence, a popular approach is to take that conditioning sequence as a priming sequence and ask a Transformer decoder to generate a continuation. However, this prompt-based conditioning cannot guarantee that the conditioning sequence would develop or even simply repeat itself in the generated continuation. In this paper, we propose an alternative conditioning approach, called theme-based conditioning, that explicitly trains the Transformer to treat the conditioning sequence as a thematic material that has to manifest itself multiple times in its generation result. This is achieved with two main technical contributions. First, we propose a deep learning-based approach that uses contrastive representation learning and clustering to automatically retrieve thematic materials from music pieces in the training data. Second, we propose a novel gated parallel attention module to be used in a sequence-to-sequence (seq2seq) encoder/decoder architecture to more effectively account for a given conditioning thematic material in the generation process of the Transformer decoder. We report on objective and subjective evaluations of variants of the proposed Theme Transformer and the conventional prompt-based baseline, showing that our best model can generate, to some extent, polyphonic pop piano music with repetition and plausible variations of a given condition.

Abstract:
The human visual system response to picture quality degradation due to packet loss is very different from the responses of objective quality measures. While video quality due to packet loss may be impaired by at most for one Group of Pictures (GOP), its subjective quality degradation may last for several GOPs. This has a great impact on resource allocation strategies, which normally make decisions on instantaneous conditions of multiplexing buffer. This is because, when the perceptual impact of degraded video quality is much longer than its objective degradation period, any assigned resources to the degraded flow is wasted. This paper, through both simulations and analysis shows that, during resource allocation, if the quality of a video stream is significantly degraded, it is better to penalize this degraded flow from getting its full bandwidth share and instead assign the remaining share to other flows preventing them from undergoing quality degradation.

Abstract:
Low-light image enhancement aims to improve the quality of images captured under low-lightening conditions, which is a fundamental problem in computer vision and multimedia areas. Although many efforts have been invested over the years, existing illumination-based models tend to generate unnatural-looking results (e.g., over-exposure). It is because that the widely-adopted illumination adjustment (e.g., Gamma Correction) breaks down the favorable smoothness property of the original illumination derived from the well-designed illumination estimation model. To settle this issue, a great-efficiency and high-quality Self-Reinforced Retinex Projection (SRRP) model is developed in this paper, which contains optimization modules of both illumination and reflectance layers. Specifically, we construct a new fidelity term with the self-reinforced function for the illumination optimization to eliminate the dependence of the illumination adjustment to obtain a desired illumination with the excellent smoothing property. By introducing a flexible feasible constraint, we obtain a reflectance optimization module with projection. Owing to its flexibility, we can extend our model to an enhanced version by integrating a data-driven denoising mechanism as the projection, which is able to effectively handle the generated noises/artifacts in the enhanced procedure. In the experimental part, on one side, we make ample comparative assessments on multiple benchmarks with considerable state-of-the-art methods. These evaluations fully verify the outstanding performance of our method, in terms of the qualitative and quantitative analyses and execution efficiency. On the other side, we also conduct extensive analytical experiments to indicate the effectiveness and advantages of our proposed model.

Abstract:
Attention has become an indispensable component of the models of various multimedia tasks like Image Captioning (IC) and Visual Question Answering (VQA). However, most existing attention modules are designed for capturing the spatial dependency, and are still insufficient in semantic understanding, e.g., the categories of objects and their attributes, which is also critical for image captioning. To compensate for this defect, we propose a novel attention module termed Channel-wise Attention Block (CAB) to model channel-wise dependency for both visual modality and linguistic modality, thereby improving semantic learning and multi-modal reasoning simultaneously. Specifically, CAB has two novel designs to tackle with the high overhead of channel-wise attention, which are the reduction-reconstruction block structure and the gating-based attention prediction. Based on CAB, we further propose a novel Semantic-enhanced Dual Attention Transformer (termed SDATR), which combines the merits of spatial and channel-wise attentions. To validate SDATR, we conduct extensive experiments on the MS COCO dataset and yield new state-of-the-art performance of 134.5 CIDEr score on COCO Karpathy test split and 136.0 CIDEr score on the official online testing server. To examine the generalization of SDATR, we also apply it to the task of visual question answering, where superior performance gains are also witnessed. The code and models are publicly available at https://github.com/xmu-xiaoma666/SDATR.

Affiliations: Department of Computer Science, Huaqiao University, Xiamen, Fujian, China; Xiamen Key Laboratory of Computer Vision and Pattern Recognition and Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Xiamen, Fujian, China; Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong; Center for Future Multimedia and School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, Sichuan, China; Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Nanjing University of Science and Technology, Nanjing, Jiangsu, China

Abstract:
Cross-modal hashing hasrecently gained an increasing attention for its efficiency and fast retrieval speed in indexing the multimedia data across different modalities. Nevertheless, the multimedia data points often emerge in a streaming manner, and existing online methods often lack of learning capacity to handle both labeled and unlabeled data.To alleviate these concerns, this paper proposes an Online Manifold-Guided Hashing (OMGH) framework, which can incrementally learn the compact hash code of streaming data while adaptively optimizing the hash function in a streaming manner. To be specific, OMGH first exploits a matrix tri-factorization framework to learn the discriminative hash codes for streaming multi-modal data. Then, an online anchor-based manifold structure is designed to sparsely represent the old data and adaptively guide the hash code learning process, which can wellreduce the complexity in preserving the semantic correlation between the old data and streaming data. Meanwhile, such anchor-based manifold embedding is adaptive to the unsupervised and supervised learning strategies in a flexible way. Besides, an online discrete optimization method is efficiently addressed to incrementally update the hash functions and optimize the hash codes on streaming data points. As a result, the derived hash codes are more semantically meaningful for various online cross-modal retrieval tasks. Extensive experiments verify the advantages of the proposed OMGH model, by achieving and improving the state-of-the-art cross-modal retrieval performances on three benchmark datasets.

Abstract:
Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the compositional reasoning, recent works attempt to introduce the scene graph in VQA. However, as the generated scene graphs are usually quite noisy, it greatly limits the performance of question answering. Therefore, this paper proposes to refine the scene graphs for improving the effectiveness. Specifically, we present a novel Scene Graph Refinement network (SGR), which introduces a transformer-based refinement network to enhance the object and relation features for better classification. Moreover, as the question provides valuable clues for distinguishing whether the \left\langle \mathitsubject, predicate, object \right\rangle triplets are helpful or not, the SGR network exploits the semantic information presented in the questions to select the most relevant relations for question answering. Extensive experiments are conducted on the GQA benchmark demonstrate the effectiveness of our method.

Abstract:
In this paper, we propose an efficient compression scheme for focal stack images (FoSIs) based on a new basis-quadtree representation. In the new basis-quadtree representation, FoSIs are initially reorganized as co-located block groups in the depth dimension. In each group, selective basis blocks and adaptive quadtree partition are optimized to predict the focused or defocused co-located blocks by intra-group approximation. By solving a joint optimization problem, FoSIs can be efficiently represented by the optimal basis blocks, corresponding quadtree partition and approximation parameters, which will be compressed separately. Then, these basis blocks are stitched into several new frames (basis frames) according to their original locations and partition modes. Basis frames are compressed by our designed encoder, where the intra-group approximation is embedded into the high efficiency video coding (HEVC) encoder. Thus, the redundancies of basis blocks can be further eliminated. Finally, the approximation parameters are refined to suppress the amplified errors caused by introduced compression blur after basis frame coding. The refined parameters are compressed losslessly and multiplexed with the bitstream of the basis frames to ensure the reconstruction quality of FoSIs. Experiments on 12 test sequences demonstrate that the proposed scheme can obtain higher coding performance than the state-of-the-art comparison schemes. Specifically, the proposed scheme achieves up to 5.23 dB PSNR gains and 71.59% bitrate savings over the HEVC baseline scheme on sequences I03 and I05, respectively.

Abstract:
Speech-image retrieval aims at learning the relevance between image and speech.Prior approaches are mainly based on bi-modal contrastive learning, which can not alleviate the cross-modal heterogeneous issue between visual and acoustic modalities well. To address this issue, we propose a visual-acoustic-semantic embedding (VASE) method. First, we propose a tri-modal ranking loss by taking advantage of semantic information corresponding to the acoustic data, which introduces the auxiliary alignment to enhance the alignment between image and speech. Second, we introduce a cycle-consistency loss based on feature reconstruction. It can further alleviate the heterogeneous issue between different data modalities (e.g., visual-acoustic, visual-textual and acoustic-textual). Extensive experiments have demonstrated the effectiveness of our proposed method. In addition, our VASE model achieves state-of-the-art performance on the speech-image retrieval task on the Flickr8K [Harwath and Glass, 2015]s and Places [Harwath et al., 2018] datasets.

Abstract:
Fine-grained face editing, as a special case of image translation task, aims at modifying face attributes according to users’ preference. Although generative adversarial networks (GANs) have achieved great success in general image translation tasks, these models cannot be directly applied in the face editing problem. Ideal face editing is challenging as it has two special requirements – personalization and spatial-awareness. To address these issues, we propose a novel Personalized Spatial-aware Affine Modulation (PSAM) method based on a general GAN structure. The key idea is to modulate the intermediate features in a personalized and spatial-aware manner, which corresponds to the face editing procedure. Specifically, for personalization, we adopt both the face image and the desired attribute as input to generate the modulation tensors. For spatial-aware, we set these tensors to be of the same size as the input image, allowing pixel-wise modulation. Extensive experiments in four fine-grained face editing tasks, i.e., makeup, expression, illumination and aging, demonstrate the effectiveness of the proposed PSAM method. The synthesis results of PSAM can be further boosted by a new transferable training strategy.

Abstract:
In this paper, we propose a Mask-Robust Inpainting Network (MRIN) approach to recover the masked areas of an image. Most existing methods learn a single model for image inpainting, under a basic assumption that all masks are from the same type. However, we discover that the masks are usually complex and exhibit various shapes and sizes at different locations of an image, where a single model cannot fully capture the large domain gap across different masks. To address this, we learn to decompose a complex mask area into several basic types and recover the damaged image in a patch-wise manner with a type-specific generator. More specifically, our MRIN consists of a mask-robust agent and an adaptive patch generative network. The mask-robust agent contains a mask selector and a patch locator, which generates mask attention maps to select a patch at each step. Based on the predicted mask attention maps, the adaptive patch generative network inpaints the selected patch with the generators bank, so that it sequentially inpaints each patch with different patch generators according to its mask type. Extensive experiments demonstrate that our approach outperforms most state-of-the-art approaches on the Place2, CelebA, and Paris Street View datasets.

Abstract:
Many RGB-T trackers attempt to attain robust feature representation by utilizing an adaptive weighting scheme (or attention mechanism). Different from these works, we propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data by adaptively adjusting the convolutional kernels for various input images in practical tracking. Given the image pairs as input, we first encode their features with the backbone network. Then, we concatenate these feature maps and generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively. Inspired by residual connection, both the generated visible and thermal feature maps will be summarized with input feature maps. The augmented feature maps will be fed into the RoI align module to generate instance-level features for subsequent classification. To address issues caused by heavy occlusion, fast motion and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target driven attention mechanism. The spatial and temporal recurrent neural network is used to capture the direction-aware context for accurate global attention prediction. Extensive experiments on three large-scale RGB-T tracking benchmark datasets validated the effectiveness of our proposed algorithm.

Abstract:
Person re-identification is still a challenging task when moving objects or another person occludes the probe person. Mainstream methods based on even partitioning apply an off-the-shelf human semantic parsing to highlight the non-collusion part. In this paper, we apply an attention branch to learn the human semantic partition to avoid misalignment introduced by even partitioning. In detail, we propose a semantic attention branch to learn 5 human semantic maps. We also note that some accessories or belongings, such as a hat, bag, may provide more informative clues to improve the person Re-ID. Human semantic parsing, however, usually treats non-human parts as distractions and discards them. To fetch the missing clues, we design a branch to capture the salient non-human parts. Finally, we merge the semantic and saliency attention to build an end-to-end network, named as S^2-Net. Specifically, to further improve Re-ID, we develop a trade-off weighting scheme between semantic and saliency attention and set the right weight with the actual scene. The extensive experiments show that S^2-Net gets the competitive performance. S^2-Net achieves 87.4% mAP on Market1501 and obtains 79.3%/56.1% rank-1/mAP on MSMT17 without semantic supervision. The source codes are available at https://github.com/upgirlnana/S2Net.

Abstract:
Recently, anomaly detection and localization in multimedia data have received significant attention among the machine learning community. In real-world applications such as medical diagnosis and industrial defect detection, anomalies only present in a fraction of the images. To extend the reconstruction-based anomaly detection architecture to the localized anomalies, we propose a self-supervised learning approach through random masking and then restoring, named Self-Supervised Masking (SSM) for unsupervised anomaly detection and localization. SSM not only enhances the training of the inpainting network but also leads to great improvement in the efficiency of mask prediction at inference. Through random masking, each image is augmented into a diverse set of training triplets, thus enabling the autoencoder to learn to reconstruct with masks of various sizes and shapes during training. To improve the efficiency and effectiveness of anomaly detection and localization at inference, we propose a novel progressive mask refinement approach that progressively uncovers the normal regions and finally locates the anomalous regions. The proposed SSM method outperforms several state-of-the-arts for both anomaly detection and anomaly localization, achieving 98.3% AUC on Retinal-OCT and 93.9% AUC on MVTec AD, respectively.

Abstract:
Low light very likely leads to the degradation of an image’s quality and even causes visual task failures. Existing image enhancement technologies are prone to overenhancement, color distortion or time consumption, and their adaptability is fairly limited. Therefore, we propose a new single low-light image lightness enhancement method. First, an energy model is presented based on the analysis of membrane vibrations induced by photon stimulations. Then, based on the unique mathematical properties of the energy model and combined with the gamma correction model, a new global lightness enhancement model is proposed. Furthermore, a special relationship between image lightness and gamma intensity is found. Finally, a local fusion strategy, including segmentation, filtering and fusion, is proposed to optimize the local details of the global lightness enhancement images. Experimental results show that the proposed algorithm is superior to nine state-of-the-art methods in avoiding color distortion, restoring the textures of dark areas, reproducing natural colors and reducing time cost.

Abstract:
Recognizing the ingredients composition for given food images facilitates the estimation of nutrition facts, which is crucial to various health relevant applications. Nevertheless, ingredient recognition is a multi-label long-tailed classification problem, where each image may contain multiple labels and the class distributions are highly imbalanced. Most existing approaches leverage off-the-shelf Convolutional Neural Networks (CNN) for multi-label ingredient recognition, overlooking the long-tailed issue, which results in low accuracy for tail ingredient categories. To address this problem, this paper proposes a dynamic Mixup (D-Mixup) approach, aiming to dynamically augment minority ingredients, in order to boost the recognition performance for tail ingredient categories. Specifically, our D-Mixup approach dynamically selects two training images based on the predictions of the previous training epoch, and generates a new synthetic image to train the recognition network. In this way, the training samples of tailed classes can be dynamically enlarged and better discriminative representations can be learnt for rare classes. Extensive experiments on both VIREO Food-172 dataset and UEC Food-100 dataset demonstrate the effectiveness of the proposed D-Mixup method.

Abstract:
Optical character recognition and machine translation are usually studied and applied separately. In this paper, we consider a new problem named cross-lingual text image recognition (CLTIR) that integrates these two tasks together. The core of this problem is to recognize source language texts shown in images and transcribe them to the target language in an end-to-end manner. Traditional cascaded systems perform text image recognition and text translation sequentially. This can lead to error accumulation and parameter redundancy problems. To overcome these problems, we propose a multihierarchy cross-modal mimic (MHCMM) framework for end-to-end CLTIR, which can be trained with a massive bilingual text corpus and a small number of bilingual annotated text images. In this framework, a plug-in machine translation model is used as a teacher to guide the CLTIR model for learning representations compatible with image and text modes. Via adversarial learning and attention mechanisms, the proposed mimic method can integrate both global and local information in the semantic space. Experiments on a newly collected dataset demonstrate the superiority of the proposed framework. Our method outperforms other pipelines while containing fewer parameters. Additionally, the MHCMM framework can utilize a large-scale bilingual corpus to further improve the performance efficiently. The visualization of attention scores indicates that the proposed model can read text images in a fashion similar to the machine translation model reading text tokens.

Abstract:
3D object classification is an important task in computer vision. In order to explore the high-order and multi-modal correlations among 3D data, we propose an adaptive multi-hypergraph convolutional networks (AMHCN) framework to enhance 3D object classification performance. The proposed network improves the current hypergraph neural networks in two aspects. Firstly, existing networks rely on hyperedge constrained neighborhoods for feature aggregation, which may introduce noise or ignore positive information outside the hyperedges. To this end, we develop the partially absorbing random walks (PARW) to hypergraph for capturing optimal vertex neighborhoods from hypergraph globally. Then, based on the PARW on hypergraph, we design a new hypergraph convolution operator to learn deep embeddings from the optimized high-order correlation, which enables effective information propagation among the most relevant vertices. Secondly, concerning the multi-modal representations in practice, the current multi-modal hypergraph learning models either treat all modalities equally or introduce abundant parameters to learn weights of different modalities. To overcome these shortcomings, we propose a simple but effective dynamic weighting strategy for combining multi-modal representations, in which the importance of each modality can be adjusted adaptively by the loss function. We apply the proposed model to 3D object classification, and the experimental results on two 3D benchmark datasets demonstrate that our method outperforms the state-of-the-art methods, testifying to the effectiveness of both our convolution method and multi-modality fusion strategy.

Abstract:
Violence detection in videos can help maintain public order, detect crimes, or provide timely assistance. In this paper, we aim to leverage multimodal information to determine whether successive frames contain violence. Specifically, we propose an audiovisual dependency attention (AVD-attention) module modified from the co-attention architecture to fuse visual and audio information, unlike commonly used methods such as the feature concatenation, addition, and score fusion. Because the AVD-attention module’s dependency map contains sufficient fusion information, we argue that it should be applied more sufficiently. A combination pooling method is utilized to convert the dependency map to an attention vector, which can be considered a new feature that includes fusion information or a mask of the attention feature map. Since some information in the input feature might be lost after processing by attention modules, we employ a multimodal low-rank bilinear method that considers all pairwise interactions among two features in each time step to complement the original information for output features of the module. AVD-attention outperformed co-attention in experiments on the XD-Violence dataset. Our system outperforms state-of-the-art systems.

Abstract:
Fashion outfit recommendation has attracted lots of attention recently. The problem becomes even more interesting and challenging when considering users’ personalized fashion preferences. Although existing works have successfully improved the recommendation accuracy, the efficiency issue of computation and storage is still under-investigated and often ignored. In this paper, we propose a discrete content-based tensor factorization model that maps items and user to binary codes for efficient fashion recommendation. We introduce a probabilistic perspective for learning to hash, where the binary codes are sampled from a set of underlying Bernoulli variables. To demonstrate the effectiveness of our model, we collect a large-scale outfit dataset together with user label information from a fashion-focused social website. Extensive experiments on our dataset show that the proposed model outperforms other state-of-the-art methods.

Abstract:
Although there are lots of studies on scene text recognition, few of them focus on the recognition of the incomplete text. The recognition performance of existing text recognition algorithms on the incomplete text is far from the expected, and the recognition of the incomplete text is still challenging. In this paper, an end-to-end Two-Stage Inpainting Network for Incomplete Text (TSINIT) is proposed to reconstruct the incomplete text into the complete one even when the text is in various styles and with various backgrounds, and the reconstructed text can be recognized by the existing text recognition algorithms correctly. The proposed TSINIT is divided into text extraction module (TEM) and text reconstruction module (TRM) to make the inpainting only focus on the text. TEM separates the incomplete text from the background and character-like regions at the pixel level, which can reduce the ambiguity of text reconstruction caused by the background. TRM reconstructs the incomplete text towards the most possible text with the consideration of the abstract and semantic structures of the text. Furthermore, we build a synthetic incomplete text dataset (SITD), which contains contaminated and abraded text images. SITD is divided into 6 incomplete levels according to the number of pixels in the incomplete regions and the ratio of the incomplete characters to all characters. The experimental results show that the proposed method has better inpainting ability for the incomplete text compared with traditional image inpainting algorithms on the proposed SITD and real images. When using the same text recognition method, the recognition accuracy of the incomplete text on SITD can be improved much more with the help of the proposed TSINIT than with the traditional image inpainting methods.

Abstract:
Cross-modal communications, devoting to collaboratively delivering and processing audio, visual, and haptic signals, have gradually become the supporting technology for the emerging multi-modal services. However, the inevitable resource competitions among different modality signals as well as the unexpected packet loss and latency during transmission seriously affect quality of the received signals and end user's immersive experience (especially visual experience). To overcome these dilemmas, this paper proposes a cross-modal signal reconstruction strategy from the perspective of human's perceptual facts. It tries to guarantee visual signal quality by considering potential correlations among modalities when processing audio and haptic signals. On the one hand, a time-frequency masking-based audio-haptic redundancy elimination mechanism is designed by resorting to the similarity of audio-haptic characteristics and human's masking effects. On the other hand, based on the fact that non-visual perception can assist to form and enhance visual perception, an audio-haptic fused visual signal restoration (AHFVR) approach for handling the impaired and delayed visual signals is proposed. Experiments on a standard multi-modal database and a constructed practical platform evaluate the performance of the proposed perception-aware cross-modal signal reconstruction strategy.

Abstract:
Scene text detection is still a challenging task, as there may be extremely small or low-resolution strokes and close or arbitrary-shaped texts. In this paper, StrokeNet proposes to effectively detect the texts by capturing the fine-grained strokes and inferring structural relations between the hierarchical representations of each text area in the graph-based network. Different from existing approaches that represent the text area by a series of points or rectangular boxes, we directly localize the strokes of each text instance. We introduce Stroke Assisted Prediction Network (SAPN), which performs hierarchical representation learning of text areas, effectively capturing extremely small or low-resolution texts. We extract a series of text- and stroke-level rectangular boxes on the predicted text areas, which are treated as graph nodes and grouped to form the corresponding local graphs. Hierarchical Relation Graph Network (HRGN) then performs relational reasoning and predicts the likelihood of linkages among graph nodes of different levels. It efficiently splits the close text instances and grouping node classification results into the arbitrary-shaped text area. We introduce a novel dataset with stroke-level annotations, namely SynthStroke, for offline pre-training of widespread text detectors. Experiments on benchmarks verify the State-of-the-Art performance of our method.

Abstract:
Self-supervised monocular depth estimation has succeeded in learning scene geometry from only image pairs or sequences. However, it is still highly ill-posed for self-supervised depth estimation to generate high-quality depth maps with both global high accuracy and local fine details. To address this issue, we propose a novel frequency-based recurrent refinement scheme to improve the self-supervised depth estimation. Since the global and local depth representation can be correlated to high/low frequency coefficients in the frequency domain, we propose a frequency-based recurrent depth coefficient refinement (RDCR) scheme, which progressively refines both low frequency and high frequency depth coefficients with an RNN-based architecture in a multi-level manner. During the recurrent process, the depth coefficients generated from the previous time step are used as the input to generate the current depth coefficients, yielding progressively optimized depth estimations. Meanwhile, considering that the depth details often appear in areas with high image frequency, we further improve depth details during the RDCR process by leveraging the image-based high frequency components. Specifically, in each RDCR module, we enhance the high frequency depth representations by selecting and feeding the informative image-based high frequency features with a learned feature weighting mask. Extensive experiments show that the proposed method achieves globally accurate estimation with fine local details, outperforming other self-supervised methods in both quantitative and qualitative comparisons.

Abstract:
In this work, we propose a new end-to-end optimized two-stream framework called GeometryMotion-Transformer (GMT) for 3D action recognition. We first observe that the existing 3D action recognition approaches cannot well extract motion representations from point cloud sequences. Specifically, when extracting motion representations, the existing approaches do not explicitly consider one-to-one correspondence among frames. Besides, the existing methods only extract the single-scale motion representations, which cannot well model the complex motion patterns of moving objects in point cloud sequences. To address these issues, we first propose the feature extraction module (FEM) to generate one-to-one correspondence among frames without using the voxelization process, and explicitly extract both geometry and multi-scale motion representations from raw point clouds. Moreover, we also observe the existing two-stream 3D action recognition approaches simply concatenate or add the geometry and motion features, which cannot well exploit the relationship between two-steam features. To this end, we also propose an improved transformer-based feature fusion module (FFM) to effectively fuse the two-stream features. Based on the proposed FEM and FFM, we build our GMT for 3D action recognition. Extensive experimental results on four benchmark datasets demonstrate the effectiveness of our backbone GMT.

Abstract:
Recently, deep convolutional neural networks have been applied to image compressive sensing (CS) to improve reconstruction quality while reducing computation cost. Existing deep learning-based CS methods can be divided into two classes: sampling image at single scale and sampling image across multiple scales. However, these existing methods treat the image low-frequency and high-frequency components equally, which is an obstruction to get a high reconstruction quality. This paper proposes an adaptive multi-scale image CS network in wavelet domain called AMS-Net, which fully exploits the different importance of image low-frequency and high-frequency components. First, the discrete wavelet transform is used to decompose an image into four sub-bands, namely the low-low (LL), low-high (LH), high-low (HL), and high-high (HH) sub-bands. Considering that the LL sub-band is more important to the final reconstruction quality, the AMS-Net allocates it a larger sampling ratio, while allocating the other three sub-bands a smaller one. Since different blocks in each sub-band have different sparsity, the sampling ratio is further allocated block-by-block within the four sub-bands. Then a dual-channel scalable sampling model is developed to adaptively sample the LL and the other three sub-bands at arbitrary sampling ratios. Finally, by unfolding the iterative reconstruction process of the traditional multi-scale block CS algorithm, we construct a multi-stage reconstruction model to utilize multi-scale features for further improving the reconstruction quality. Experimental results demonstrate that the proposed model outperforms both the traditional and state-of-the-art deep learning-based methods.

Abstract:
Few-shot learning (FSL) usually assumes that the query is drawn from the same label space as the support set, while queries from unknown classes may emerge unexpectedly in many open-world application scenarios. Such an open-set issue will limit the practical deployment of FSL systems, which remains largely unexplored. In this paper, we investigate the problem of few-shot open-set recognition (FSOR) and propose a novel solution, called Relative Feature Displacement Network (RFDNet), which empowers FSL systems to reject queries from unknown classes while accurately classifying those from known classes. First, we suggest a different relative feature displacement learning (RFDL) paradigm for FSOR, i.e., meta-learning a feature displacement relative to a pretrained reference feature embedding, based on our insightful observations on the randomness drift issue of previous meta-learning based for FSOR methods, as well as the generalization ability of the feature embedding pretrained for general classification. Second, we design the RFDNet framework to implement the RFDL paradigm, which is mainly featured by a task-aware RFD generator and a marginal open-set loss. Comprehensive experiments on three public datasets, i.e., miniImageNet, CIFAR-FS and tieredImageNet, demonstrate that RFDNet can consistently outperform the state-of-the-art methods, achieving improvement of 5.2%, 2.0% and 1.7% respectively, in terms of AUROC for unknown-class rejection under the 5-way 5-shot setting.

Abstract:
We study and address the multi-view crowd counting (MVCC) problem which poses more realistic challenges than single-view crowd counting for better facilitating crowd management/public safety systems. Its major challenge lies in how to fully distill and aggregate useful, complementary information among multiple camera views to create powerful ground-plane representations for wide-area crowd analysis. In this paper, we present a graph-based, multi-view learning model called Co-Communication Graph Convolutional Network (CoCo-GCN) to jointly investigate intra-view contextual dependencies and inter-view complementary relations. More specifically, CoCo-GCN builds a view-agnostic graph interaction space for each camera view to conduct efficient contextual reasoning, and extends the intra-view reasoning by using a novel Graph Communication Layer (GCL) to also take between-graph (cross-view), complementary information into account. Moreover, CoCo-GCN uses a new Co-Memory Layer (CoML) to jointly coarsen the graphs and close the ‘representational gap’ among them for further exploiting the compositional nature of graphs and learning more consistent representations. Finally, these jointly learned features of multiple views can be easily fused to create ground-plane representations for wide-area crowd counting. Experiments show that the proposed CoCo-GCN achieves state-of-the-art results on three MVCC datasets, i.e., PETS2009, DukeMTMC, and City Street, significantly improving the scene-level accuracy over previous models.

Abstract:
Audio mixture separation is still challenging due to heavy overlaps and interactions. To correctly separate audio mixtures, we propose a novel self-supervised Fine-grained Cycle-Separation Network (FCSN) for vision-guided audio mixture separation. In the proposed approach, we design a two-stage procedure to perform self-supervised separation on audio mixtures. Using visual information as guidance, a primary-stage separation is realized via a U-net network, then the residual spectrogram is calculated by removing separated spectrograms from the original audio mixture. At the second-stage separation, a cycle-separation module is proposed to refine separation using separated results and the residual spectrogram. Self-supervision learning between vision and audio modalities is presented to push the cycle separation until the residual spectrogram becomes empty. Extensive experiments are evaluated on three large-scale datasets, MUSIC (MUSIC-21), AudioSet, and VGGSound. Experiment results certify that our approach outperforms the state-of-the-art approaches, and demonstrate the effectiveness for separating audio mixtures with overlap and interaction.

Abstract:
Composing and recognizing novel concepts that are combinations of known concepts, i.e., compositional generalization, is one of the greatest power of human intelligence. With the development of artificial intelligence, it becomes increasingly appealing to build a vision system that can generalize to unknown compositions based on restricted known knowledge, which has so far remained a great challenge to our community. In fact, machines can be easily misled by superficial correlations in the data, disregarding the causal patterns that are crucial to generalization. In this paper, we rethink compositional generalization with a causal perspective, upon the context of Compositional Zero-Shot Learning (CZSL). We develop a simple yet strong approach based on our novel Decomposable Causal view (dubbed “DeCa”), by approximating the causal effect with the combination of three easy-to-learn components. Our proposed DeCa1 is evaluated on two challenging CZSL benchmarks by recognizing unknown compositions of known concepts. Despite being simple in the design, our approach achieves consistent improvements over state-of-the-art baselines, demonstrating its superiority towards the goal of compositional generalization.

Abstract:
Recently, methods for unsupervised embedding learning have exhibited promising results for extracting desirable representations from unlabeled samples. In general, most methods learn the feature embeddings by handling each sample individually while the structural and semantic relationships between samples are not fully exploited. As a result, the learned embeddings are not sufficiently discriminative. To make use of such inter-sample information for deep embedding learning, this paper proposes an unsupervised method based on the graph convolutional network (GCN). On one hand, our method encodes structural information between the samples corresponding to the nodes in a local neighbourhood of the GCN graph. On the other hand, it leverages the mutual information between the original samples and the augmented ones to ensure that they are globally consistent with each other. Extensive experiments show that our method is not just robust to augmentation perturbations, but also learns discriminative embeddings. Consequently, it achieves the state-of-the-art performance on several challenging datasets.

Abstract:
State-of-the-art image style transfer methods have achieved impressive results by using neural networks. However, neural style transfer (NST) methods either ignore the local details of the style image by using the global statistics for style modeling or cannot fully use shallow features of neural networks, leading to the synthesized image having fewer details. In this study, we proposed a new patch-based style transfer method that directly operates in the image pixel domain without using any neural networks, achieving fascinating style transfer results with rich image details. The proposed method was derived from classic texture synthesis methods. Most previous methods rely on nearest neighbor search (NNS) for patch matching. However, this greedy strategy cannot guarantee the similarity of patch distributions between the synthesized image and the style image, which limits the expressiveness of textures. We solved this problem by proposing an optimal patch matching algorithm formed on the Optimal Transport (OT) theory, which theoretically guarantees the similarity of the patch distributions and gives a flexible style modeling method. Various qualitative and quantitative experiments demonstrated that the proposed method achieves better synthesized results than state-of-the-art style transfer methods, including NST and classic methods based on texture synthesis.

Abstract:
Recent methods for single image super-resolution (SISR) have demonstrated outstanding performance in generating high-resolution (HR) images from low-resolution (LR) images. However, most of these methods show their superiority using synthetically generated LR images, and their generalizability to real-world images is often not satisfactory. In this paper, we pay attention to two well-known strategies developed for robust super-resolution (SR), i.e., reference-based SR (RefSR) and zero-shot SR (ZSSR), and propose an integrated solution, called reference-based zero-shot SR (RZSR). Following the principle of ZSSR, we train an image-specific SR network at test time using training samples extracted only from the input image itself. To advance ZSSR, we obtain reference image patches with rich textures and high-frequency details which are also extracted only from the input image using cross-scale matching. To this end, we construct an internal reference dataset and retrieve reference image patches from the dataset using depth information. Using LR patches and their corresponding HR reference patches, we train a RefSR network that is embodied with a non-local attention module. Experimental results demonstrate the superiority of the proposed RZSR compared to the previous ZSSR methods and robustness to unseen images compared to other fully supervised SISR methods.

Abstract:
In this paper, we focus on the crowd localization task, a crucial topic of crowd analysis. Most regression-based methods utilize convolution neural networks (CNN) to regress a density map, which can not accurately locate the instance in the extremely dense scene, attributed to two crucial reasons: 1) the density map consists of a series of blurry Gaussian blobs, 2) severe overlaps exist in the dense region of the density map. To tackle this issue, we propose a novel Focal Inverse Distance Transform (FIDT) map for the crowd localization task. Compared with the density maps, the FIDT maps accurately describe the persons' locations without overlapping in dense regions. Based on the FIDT maps, a Local-Maxima-Detection-Strategy (LMDS) is derived to effectively extract the center point for each individual. Furthermore, we introduce an Independent SSIM (I-SSIM) loss to make the model tend to learn the local structural information, better recognizing local maxima. Extensive experiments demonstrate that the proposed method reports state-of-the-art localization performance on six crowd datasets and one vehicle dataset. Additionally, we find that the proposed method shows superior robustness on the negative and extremely dense scenes, which further verifies the effectiveness of the FIDT maps.

Abstract:
In this paper, we propose a novel calibration-free cross-camera target association algorithm that aims to relate local visual data of the same object across cameras with overlapping FOVs. Unlike other methods using object's own characteristics, our approach makes full use of the interactions between objects and explores their spatiotemporal consistency in projection transformation to associate cameras. It has wider applicability in deployed overlapping multi-camera systems with unknown or rarely available calibration data, especially if there is a large perspective gap between cameras. Specifically, we first extract trajectory intersection which is one of the typical object-object interactive behaviors from each camera for feature vector construction. Then, based on the consistency of object-object interactions, we propose a multi-camera spatiotemporal alignment method via wide-domain cross-correlation analysis. It realizes time synchronization and spatial calibration of the multi-camera system simultaneously. After that, we introduce a cross-camera target association approach using aligned object-object interactions. The local data of the same target are successfully associated across cameras without any additional calibration. Extensive experimental evaluations on different databases verify the effectiveness and robustness of our proposed method.

Abstract:
Fully unsupervised person re-identification (ReID) methods aim to learn discriminative features without using labeled ReID data. Because these methods are easily affected by camera discrepancies, similar studies have typically designed optimization methods to enable the model to learn camera-invariant features. However, they often ignore the impact of camera discrepancies on clustering results. Specifically, camera discrepancies will reduce the intra-class camera diversity and promote the generation of noise labels. To solve the above problems, we propose a unified unsupervised learning framework: camera invariant feature learning (CIFL) framework. First, we designed a novel DBSCAN-NN algorithm in the CIFL framework that improves the intra-class camera diversity by forcibly merging samples from different cameras. Then, we designed feature ensemble clustering that improves the accuracy of the pseudo-labels by clustering feature ensembles. In addition, we designed an optimization method for camera discrepancies: stochastic pulled loss. With the stochastic pulled loss, the ReID model is forced to learn camera-invariant features. We verified the effectiveness and generalization of CIFL on four ReID datasets (Market-1501, DukeMTMC-reID, MSMT17 and CUHK03-NP). The experimental results show that CIFL not only outperforms the existing fully unsupervised methods but also is superior to the unsupervised domain adaptation methods.

Abstract:
Siamese tracking paradigm has achieved great success, providing effective appearance discrimination and size estimation by classification and regression. While such a paradigm typically optimizes the classification and regression independently, leading to task misalignment (accurate prediction boxes have no high target confidence scores). In this paper, to alleviate this misalignment, we propose a novel tracking paradigm, called SiamLA. Within this paradigm, a series of simple, yet effective localization-aware components are introduced to generate localization-aware target confidence scores. Specifically, with the proposed localization-aware dynamic label (LADL) loss and localization-aware label smoothing (LALS) strategy, collaborative optimization between the classification and regression is achieved, enabling classification scores to be aware of location state, not just appearance similarity. Besides, we propose a separate localization-aware quality prediction (LAQP) branch to produce location quality scores to further modify the classification scores. To guide a more reliable modification, a novel localization-aware feature aggregation (LAFA) module is designed and embedded into this branch. Consequently, the resulting target confidence scores are more discriminative for the location state, allowing accurate prediction boxes tend to be predicted as high scores. Extensive experiments are conducted on six challenging benchmarks, including GOT-10 k, TrackingNet, LaSOT, TNL2K, OTB100 and VOT2018. Our SiamLA achieves competitive performance in terms of both accuracy and efficiency. Furthermore, a stability analysis reveals that our tracking paradigm is relatively stable, implying that the paradigm is potential for real-world applications.

Abstract:
Live holographic teleportation is an emerging media application that allows Internet users to communicate in a fully immersive environment. One distinguishing feature of such an application is the ability to teleport multiple objects from different network locations into the receiver's field of view at the same time, mimicking the effect of group-based communications in a common physical space. In this case, live teleportation frames originated from different sources must be precisely synchronised at the receiver side to ensure user experiences with eliminated perception of motion misalignment effect. For the very first time in the literature, we quantify the motion misalignment between remote sources with different network contexts in order to justify the necessity of such frame synchronisation operations. Based on this motivation, we propose HoloSync, a novel edge-computing-based scheme capable of achieving controllable frame synchronisation performances for multi-source holographic teleportation applications. We carry out systematic experiments on a real system with the HoloSync scheme in terms of frame synchronisation performances in specific network scenarios, and their sensitivity to different control parameters.

Abstract:
Human pose estimation has been widely studied with much focus on supervised learning. However, in real applications, a pretrained pose estimation model usually needs be adapted to a novel domain without labels or with sparse labels. Existing domain adaptation methods cannot well deal with it since poses have flexible topological structures and need fine-grained local features. Aiming at the characteristics of human pose, we propose a novel domain adaptation method for multi-person pose estimation (MPPE) to alleviate the human-level shift. Firstly, the training samples of human poses are clustered into groups according to the posture similarity. Within the clustered space, we conduct three adaptation modules: Cross-Attentive Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and Adaptive Human-Topology Adaptation (AHTA). The CAFA adopts a bidirectional spatial attention mechanism to explore fine-grained local feature correlation between two humans, and thus to adaptively aggregate consistent features for adaptation. ISA only works in semi-supervised domain adaptation (SSDA) to exploit semantic relationship of corresponding keypoints for reducing the intra-domain bias. Importantly, we creatively propose an AHTA to enrich human topological knowledge for reducing the inter-domain discrepancy. Specifically, the pose structure and the cross-instance topological relations are modeled via graph networks. This flexible topology learning benefits the occluded or extreme pose inference. Extensive experiments are conducted on two popular benchmarks and additional two challenging datasets. Results demonstrate the competency of our method, which works in unsupervised or semi-supervised modes, compared with the existing supervised approaches.

Abstract:
Grounding objects described in natural language to visual regions in the video is a crucial capability needed in vision-and-language fields. In this paper, we deal with the weakly-supervised video object grounding (WSVOG) task, where only video-sentence pairs are provided for learning. The essence of this task is to learn the cross-modal associations between words in textual modality and regions in visual modality. Despite the recent progress, we find that most existing methods focus on the association learning for cross-modal samples, while the rich and complementary information within uni-modal samples has not been fully exploited. To this end, we propose to explicitly learn uni-modal associations on both textual and visual sides, so as to fully exploit the useful uni-modal information for accurate video object grounding. Specifically, (1) we learn textual prototypes by considering rich contextual information of the same object in different sentences, and (2) we estimate visual prototypes in an adaptive manner so as to overcome the uncertainties in selecting object-relevant visual regions. Besides, a cross-modal correspondence is learned which not only bridges the visual and textual modalities for WSVOG task, but also tightly cooperates with the uni-modal association learning process. We conduct extensive experiments on three popular datasets, and the favorable results demonstrate the effectiveness of our method.

Abstract:
Object re-identification (re-ID) is one of the core technologies in Multi-Object Tracking (MOT) that requires real-time decision-making. A Neural Processing Unit (NPU) is a low-power device that is dedicated to deploying neural network-based algorithms and has become one of the most important devices in today's mobile onboard systems. However, the current mainstream re-ID methods rarely consider the NPU characteristics, which makes it difficult for these methods to achieve both high onboard frame rates and high accuracies on an NPU. To address this problem, this paper focuses on designing a re-ID algorithm suitable for NPU deployment. The model of the object re-ID can be divided into two parts: the encoder (backbone) and the decoder. In this article, a Mobile-efficient Pure Part Model (MPPM) is presented for re-ID task. First, for the backbone of re-ID, we propose an efficient structure GogglesNet, which is composed of traditional convolutions. GogglesNet performs well on the re-ID task and can be comparable to lightweight networks on ImageNet with regard to accuracy and is faster on NPU. We then revisit the architectures of Pure Part Model (PPM) in person re-ID, including PCB and MGN, and propose a mobile-efficient decoder Dual Pattern Network (DPN) for re-ID. The proposed MPPM achieves comparable performance with MGN on five re-ID datasets Market-1501, DukeMTMC-reID, MSMT17, VeRi-776, and VehicleID, while the proposed parameter amount is only 10.2% of it, and the speed on NPU is more than eight times higher.

Abstract:
Class-Incremental Learning (CIL) aims at incrementally learning novel classes without forgetting old ones. This capability becomes more challenging when novel tasks contain one or a few labeled training samples, which leads to a more practical learning scenario, i.e., Few-Shot Class- Incremental Learning (FSCIL). The dilemma on FSCIL lies in serious overfitting and exacerbated catastrophic forgetting caused by the limited training data from novel classes. In this paper, excited by the easy accessibility of unlabeled data, we conduct a pioneering work and focus on a Semi-Supervised Few-Shot Class-Incremental Learning (Semi-FSCIL) problem, which requires the model incrementally to learn new classes from extremely limited labeled samples and a large number of unlabeled samples. To address this problem, a simple but efficient framework is first constructed based on the knowledge distillation technique to alleviate catastrophic forgetting. To efficiently mitigate the overfitting problem on novel categories with unlabeled data, uncertainty-guided semi-supervised learning is incorporated into this framework to select unlabeled samples into incremental learning sessions considering the model uncertainty. This process provides extra reliable supervision for the distillation process and contributes to better formulating the class means. Our extensive experiments on CIFAR100, miniImageNet and CUB200 datasets demonstrate the promising performance of our proposed method, and define baselines in this new research direction.

Abstract:
Lip reading is the task of decoding text from speakers' mouth movements. Numerous deep learning-based methods have been proposed to address this task. However, these existing deep lip reading models suffer from poor generalization due to overfitting the training data. To resolve this issue, we present a novel learning paradigm that aims to improve the interpretability and generalization of lip reading models. In specific, a Variational Temporal Mask (VTM) module is customized to automatically analyze the importance of frame-level features. Furthermore, the prediction consistency constraints of global information and local temporal important features are introduced to strengthen the model generalization. We evaluate the novel learning paradigm with multiple lip reading baseline models on the LRW and LRW-1000 datasets. Experiments show that the proposed framework significantly improves the generalization performance and interpretability of lip reading models.

Abstract:
Attributed graph clustering, which learns node representation from node attribute and topological graph for clustering, is a fundamental and challenging task for multimedia network-structured data analysis. Recently, graph contrastive learning (GCL)-based methods have obtained impressive clustering performance on this task. Nevertheless, there still remain some limitations to be solved: 1) most existing methods fail to consider the self-consistency between latent representations and cluster structures; and 2) most methods require a post-processing operation to get clustering labels. Such a two-step learning scheme results in models that cannot handle newly generated data, i.e., out-of-sample (OOS) nodes. To address these issues in a unified framework, a Self-consistent Contrastive Attributed Graph Clustering (SCAGC) network with pseudo-label prompt is proposed in this article. In SCAGC, by clustering labels prompt information, a self-consistent contrastive loss, which aims to maximize the consistencies of intra-cluster representations while minimizing the consistencies of inter-cluster representations, is designed for representation learning. Meanwhile, a clustering module is built to directly output clustering labels by contrasting the representation of different clusters. Thus, for the OOS nodes, SCAGC can directly calculate their clustering labels. Extensive experimental results on seven benchmark datasets have shown that SCAGC consistently outperforms 16 competitive clustering methods.

Abstract:
Weakly-supervised temporal action localization aims to localize actions from untrimmed long videos with only video-level category labels. Most previous methods ignore the incompleteness issue of Class Activation Sequences (CAS), suffering from trivial detection results. To tackle this issue, we propose a novel Adaptive Mutual Supervision (AMS) framework with two branches, where the base branch detects the most discriminative action regions, while the supplementary branch localizes the less discriminative action regions through an adaptive sampler. The sampler dynamically updates the inputs for the supplementary branch using a sampling weight sequence negatively correlated with the CAS from the base branch, thus encouraging the supplementary branch to localize the action regions underestimated by the base branch. To promote mutual enhancement between two branches, we further construct mutual location supervision. Each branch adopts the location pseudo-labels generated from the other branch as the localization supervision. By alternately optimizing two branches for multiple iterations, we progressively complete action regions. Extensive experiments on THUMOS14 and ActivityNet1.2 demonstrate that the proposed AMS method significantly outperforms state-of-the-art methods.

Abstract:
Haze reduces the visibility of image content and leads to failure in handling subsequent computer vision tasks. In this paper, we address the problem of single image dehazing by proposing a dehazing network named T-Net, which consists of a backbone network based on the U-Net architecture and a dual attention module. Multi-scale feature fusion can be achieved by using skip connections with a new fusion strategy. Furthermore, by repeatedly unfolding the plain T-Net, Stack T-Net is proposed to take advantage of the dependence of deep features across stages via a recursive strategy. To reduce network parameters, the intra-stage recursive computation of ResNet is adopted in our Stack T-Net. We take both the stage-wise result and the original hazy image as input to each T-Net and finally output the prediction of the clean image. Experimental results on both synthetic and real-world images demonstrate that our plain T-Net and the advanced Stack T-Net perform favorably against state-of-the-art dehazing algorithms and show that our Stack T-Net could further improve the dehazing effect, demonstrating the effectiveness of the recursive strategy.

Abstract:
Video super-resolution (VSR) is a fundamental and challenging task in computer vision. Many of the existing VSR works focus on how to effectively align neighboring frames to better incorporate temporal information, while little work is devoted to the important subsequent step of inter-frame information fusion, and the existing methods on frame fusion have shortcomings such as not being able to make full use of spatio-temporal information. In this work, we propose a Frame-by-frame Feedback Fusion Network (FFFN) for VSR tasks. By applying the feedback learning mechanism commonly existing in the human cognitive system to the frame fusion stage, FFFN can refine low-level representation of the fused frames with high-level information in a coarse-to-fine manner. Specifically, after the neighboring frames are aligned, we first rearrange them from near to far according to the distance from the reference frame in the temporal space, and then feed them one-by-one into a proposed recurrent structure called Feedback Fusion Module (FFM), which is then able to iteratively generate high-level representation of the fused frames with several Feature Refinement Groups (FRGs) and feedback connections. Finally, we design a Dual-path Residual Reconstruction Module (DRRM) to reconstruct the final high-resolution image. The proposed FFFN comes with a strong frame fusion and reconstruction ability, and extensive experiments on several benchmark data sets show that it achieves favorable performance against state-of-the-art methods.

Abstract:
How to explore the interaction between image aesthetic rules and crops is the key to finding views with good composition. Besides, it is subjective to evaluate candidate crops, which mainly depends on aesthetic knowledge, but it is not an easy task for people without extensive photography experience. However, existing methods mostly find good views by extracting general aesthetic features of crops without fully exploring the aesthetic rules. Motivated by this, we innovatively propose a composition-guided image cropping aesthetic assessment network (CGICAANet) for efficiently finding good crops and optimizing the cropping operation. Specifically, we adopt a direct and comprehensive composition pattern module, which adaptively mines suitable compositions for the images and emphasizes the dominant position of visual elements to contribute to optimizing the best crops in an interpretable way. Moreover, we designed a multi-task loss function to train the model. Particularly, to explore the commonality between predicted crops and labels, the complete intersection-over-union loss is adopted thoroughly considering the overlap area, central point distance and the consistency of aspect ratios for crops concurrently. Therefore, the predicted best crop can preserve the visual elements and have better composition. Experimental results with lightweight MobileNetV2 and ShuffleNetV2 as backbone networks demonstrate that our method can obtain comparable or better performance in terms of efficiency and accuracy.

Abstract:
The aim of operation chain detection for a given manipulated image is to reveal the operations involved and the order in which they were applied, which is significant for image processing and multimedia forensics. Currently, all existing approaches simply treat image operation chain detection as a classification problem and consider only chains of at most two operations. Considering the complex interplay between operations and the exponentially increasing solution space, detecting longer operation chains is extremely challenging. To address this issue, in this work, we devise a new methodology for image operation chain detection. Different from existing approaches based on classification modeling, we strategically conduct operation chain detection within a machine translation framework. Specifically, the chain in our work is modeled as a sentence in a target language, with each possible operation represented by a word in that language. When executing chain detection, we propose first transforming the input image into a sentence in a latent source language from the learned deep features. Then, we propose translating the latent language into the target language within a machine translation framework and finally decoding all operations, arranged in order. Besides, a chain inversion strategy and a bi-directional modeling mechanism are developed to improve the detection performance. We further design a weighted cross-entropy loss to alleviate the problems presented by imbalance among chain lengths and chain categories. Our method can detect operation chains containing up to seven operations and obtains very promising results in various scenarios for the detection of both short and long chains.

Abstract:
Nowadays, people are accustomed to posting images and associated text for expressing their emotions on social networks. Accordingly, multimodal sentiment analysis has drawn increasingly more attention. Most of the existing image-text multimodal sentiment analysis methods simply predict the sentiment polarity. However, the same sentiment polarity may correspond to quite different emotions, such as happiness vs. excitement and disgust vs. sadness. Therefore, sentiment polarity is ambiguous and may not convey the accurate emotions that people want to express. Psychological research has shown that objects and words are emotional stimuli and that semantic concepts can affect the role of stimuli. Inspired by this observation, this paper presents a new MUlti-Level SEmantic Reasoning network (MULSER) for fine-grained image-text multimodal emotion classification, which not only investigates the semantic relationship among objects and words respectively, but also explores the semantic relationship between regional objects and global concepts. For image modality, we first build graphs to extract objects and global representation, and employ a graph attention module to perform bilevel semantic reasoning. Then, a joint visual graph is built to learn the regional-global semantic relations. For text modality, we build a word graph and further apply graph attention to reinforce the interdependencies among words in a sentence. Finally, a cross-modal attention fusion module is proposed to fuse semantic-enhanced visual and textual features, based on which informative multimodal representations are obtained for fine-grained emotion classification. The experimental results on public datasets demonstrate the superiority of the proposed model over the state-of-the-art methods.

Abstract:
Meta-learning provides a promising way for deep learning models to efficiently learn in few-shot learning. With this capacity, many deep learning systems can be applied in many real applications. However, many existing meta-learning based few-shot learning systems suffer from vulnerable generalization when new tasks are from unseen domains (a.k.a, cross-domain few-shot learning). In this work, we consider this problem from the perspective of designing a model-agnostic meta-training framework to improve the generalization of existing meta-learning methods in cross-domain few-shot learning. In this way, compared with focusing on elaborately designing modules for a specific meta-learning model, our method is endowed with the ability to be compatible with different meta-learning models in various few-shot problems. To achieve this goal, a novel adversarial meta-training framework is proposed. The proposed framework utilizes max-min episodic iteration. In the episode of maximization, our framework focuses on how to dynamically generate appropriate pseudo tasks which benefit learning cross-domain knowledge. In the episode of minimization, our method aims to solve how to help meta-learning model learn cross-task and robust meta-knowledge. To comprehensively evaluate our framework, experiments are conducted on two few-shot learning settings, three meta-learning models, and eight datasets. These results demonstrate that our method is applicable to various meta-learning models in different few-shot learning problems. The superiority of our method is verified compared with existing state-of-the-art methods.

Abstract:
Point clouds are becoming a popular medium to describe 3D scenes, benefitting from their accuracy and completeness in expressing the spatial and geometrical information of objects. However, due to the disorder and uneven distribution nature, merely selecting neighbors for point clouds in Euclidean space is inefficient and position-ignoring. To fill this gap, we propose a structure-aware graph convolution network (SA-GCN), which consists of an adaptive dilated KNN module (ADKNN), a learnable graph filter (LGF), and a structure-aware feature transformation module (SFT). Specially, the ADKNN module can dynamically adjust the range of grouping neighbor points, while being universal to improve the performance of arbitrary KNN-based methods. Moreover, with the localized auxiliary information provided by LGF, our SFT module disentangles the spatial details as a sort of coding guidance for better deep feature representations. Extensive experimental results on point cloud classification and segmentation tasks demonstrate the superiority of our proposed network.

Abstract:
3D point cloud data formats are used to express three-dimensional (3D) information using numerous points in a 3D space. A key challenge is the delivery of high-quality 3D point cloud for the users under a diverse channel quality and available bandwidth to share the same 3D space across multiple untethered extended reality (XR) users. The existing digital-based schemes suffer from two issues owing to the diversity: cliff and leveling-off effects. This paper proposes a novel soft multicasting scheme of point cloud data for untethered XR users. The key ideas of the proposed scheme are three-fold: 1) integration of graph signal processing and analog modulation to adaptively improve the 3D reconstruction quality according to the channel quality for all individual XR users, 2) integration of Givens rotation and non-uniform adaptive quantization to reduce metadata overhead for the graph Fourier transform, and 3) prioritized transmission of the metadata to realize adaptive quality improvement based on the bandwidth available for each XR user. This paper reveals that the proposed scheme prevents cliff and leveling-off effects even when the XR users experience different channel qualities. Furthermore, the proposed transmission exhibits better 3D reconstruction quality compared with the state-of-the-art graph-based delivery scheme in band-limited environments.

Abstract:
In the multi-view multi-label (MVML) classification problem, multiple views are simultaneously associated with multiple semantic representations. Multi-view multi-label learning inevitably has the problems of consistency, diversity, and non-alignment among views and the correlation among labels. Most of the existing multi-view multi-label methods for non-aligned views assume that each view has a common or shared label set, but because a single view cannot contain the entire label information, they often learn suboptimal results. Based on this, this paper proposes a non-aligned multi-view multi-label classification method that learns view-specific labels (LVSL), aiming to explicitly mine the information of view-specific labels and low-rank label structures in non-aligned views in a unified model framework. Furthermore, to alleviate insufficient available label information, we thoroughly explored the global and local structural information among labels. Specifically, first, we assume that there is structural consistency between the view and the label space and then construct the view-specific label model in turn. Second, to enrich the original label space information, we mine the consistent information of multiple views and the low-rank correlation information hidden among multiple labels. Finally, the contribution weight of each view is combined with learning the complementary information among the views in the decision-making stage, and extend the model to handle nonlinear data. The results of the proposed method compared with existing state-of-the-art algorithms on several datasets validate its effectiveness.

Abstract:
The recent emergence of light field technology has led to new opportunities for immersive visual communication that has a need for high spatial and angular resolution, both of which contribute to a large image storage footprint and high-latency transmission. Task-driven downsampling methods have been proposed as a solution, and have shown improvements in single-image restoration. However, they are inevitable to disregard light field's intrinsic properties in the corresponding tasks. In this paper, we propose a light-field-specific task-driven downsampling framework, called the LFCrNet. The LFCrNet operates on a learning-based decreasing and increasing resolution in an end-to-end manner in order to utilize a cross-view asymmetric sampling technique. In detail, it separates raw data into disparity and non-disparity patterns by measuring pixel-wise residuals between the sub-aperture central view and auxiliary views. Then, a chain of 3-D deformable residual blocks (DRBs) is used to extract disparity features and manage these features regard of their intrinsic property individually. Afterwards, they are compacted into spatio-angular domains through a 3-D deformable downsampler (3-DDS). The non-disparity information is integrated into a separate pipeline that leverages spatial similarity across multiple light field views. This technique is capable of preserving specific occlusion components, and subsequently, restoring them using a learning-based upscaling method to generate a high-quality reconstruction. In general, our method has shown superior performance on multiple open-source datasets by a significant margin.

Abstract:
Recent literature has developed two advanced tools for image inpainting: appearance propagation and attention matching. However, given the ineffective feature reorganization and vulnerable attention maps, existing works yield suboptimal results with distorted structures and inconsistent contents. Furthermore, we observe that deep sampling layers (DSL) and shallow skip connections (SSC) in U-Net separately promote image structure inference and texture synthesis. To address the above two issues, we devise a W-shaped network (W-Net), which consists of two key components: a texture spatial attention (TSA) module in SSC and a structure channel excitation (SCE) module in DSL. W-Net is a two-stage network, with coarse and refined structures derived at each stage. Meanwhile, the TSA module fills incomplete textures with reliable attention scores under the guidance of coarse structures, which effectively diminishes inconsistency from appearance to semantics. The SCE module rectifies structures according to the difference between coarse structures and refined structures enhanced by texture features. Then the module motivates them to produce more reasonable shapes. Complete textures and refined structures constitute desired inpainted images, as the output of W-Net. Experiments on multiple datasets demonstrate the superior performance of W-Net.

Abstract:
Visual entailment (VE) is to recognize whether the semantics of a hypothesis text can be inferred from the given premise image, which is one special task among recent emerged vision and language understanding tasks. Currently, most of the existing VE approaches are derived from the methods of visual question answering. They recognize visual entailment by quantifying the similarity between the hypothesis and premise in the content semantic features from multi modalities. Such approaches, however, ignore the VE's unique nature of relation inference between the premise and hypothesis. Therefore, in this paper, a new architecture called AlignVE is proposed to solve the visual entailment problem with a relation interaction method. It models the relation between the premise and hypothesis as an alignment matrix. Then it introduces a pooling operation to get feature vectors with a fixed size. Finally, it goes through the fully-connected layer and normalization layer to complete the classification. Experiments show that our alignment-based architecture reaches 72.45% accuracy on SNLI-VE dataset, outperforming previous content-based models under the same settings.

Abstract:
Center of pressure (CoP) metrics, including CoP path length and sway area, have been used as gold standard measurements of postural and balance control in biomechanical studies. A recent study of computer-vision-based CoP metrics estimation from 3D body landmark sequences offers a more portable and comprehensive solution than conventional force plate methods to obtain these important metrics for real-time evaluation of balance control. However, obtaining accurate 3D body landmarks requires a calibrated motion capture system or on-body markers, which involves lengthy data collection and processing time and limits their implementation in home and clinical environments. Existing methods that instead use 2D body landmarks fail to adapt to different camera positions. To overcome these challenges, we propose a view-invariant deep learning framework for video-level CoP metrics estimation, including CoP path length and sway area, using pose dimension lifting and graph convolutional network (GCN). This work is the first step toward obtaining gold-standard CoP metrics with an accessible, monocular RGB camera. We propose to use a dimension lifting convolutional neural network (CNN) to obtain view-invariant 3D body landmark features from 2D body landmarks. We also propose a two-stream regression model using GCN and discrete cosine transform (DCT) for a robust CoP metrics estimation. To facilitate the line of research, we release a novel multi-view body landmark dataset containing 2D body landmarks of a wide variety of action patterns from four different camera views with synchronized CoP labels and corresponding 3D body landmarks, which enables cross-view evaluation with different camera angles. We subsequently validate the proposed method through a cross-dataset training by training the dimension lifting model on an existing balance dataset and evaluating the CoP metrics estimation on the multi-view body landmark dataset. The experiments validate that our framework achieves state-of-the-art accuracy for both CoP path length and CoP sway area using a monocular RGB camera input for unseen views.

Abstract:
Temporal sentence grounding in videos is a crucial task in vision-language learning. Its goal is retrieving a video segment from an untrimmed video that semantically corresponds to a natural language query. A video usually contains multiple semantic events, which are rarely isolated. They tend to be temporally ordered and semantically correlated (e.g., some event is often the precursor of another event). To precisely localize a semantic moment from a video, it is critical to effectively extract and aggregate multi-granularity contextual information, including the fine-grained local context around the moment-related video segment (in short snippet-level) and coarse-grained semantic correlation (in segment-level). Additionally, a second main insight in this work is that the above context aggregation should be favorably guided by the queries, rather than fully query-agnostic. Putting above ideas together, we here present a new network that does language-guided multi-granularity context aggregation. It is comprised of two major modules. The core of the first module is a novel language-guided temporal adaptive convolution (LTAC) devised to extract fine-grained information over video snippets around the ground-truth video segment. It decomposes a convolution into two channel-oriented / temporal-oriented ones. In particular, the convolutional channels are supposed to be more susceptible to queries, thus we learn to generate a dynamic channel-oriented kernel with respect to the querying sentence. As a second module, we propose a language-guided global relation block (LGRB) that extracts video-level context. It augments the contextual feature by using a multi-scale temporal attention that tackles the scale variation of ground-truth video segments, and a multi-modal semantic attention that relies on syntactic of the query. For the validation purpose, we have conducted comprehensive experiments on two popularly-adopted video benchmarks (i.e., ActivityNet Captions and Charades-STA). All experimental results and ablation studies have clearly corroborated the effectiveness of our model designs, outstripping prior state-of-the-art methods in terms of major performance metrics for the task.

Abstract:
Deep hashing has shown promising performance in large-scale image retrieval. The hashing process utilizes Deep Neural Networks (DNNs) to embed images into compact continuous latent codes, then map them into binary codes by hashing function for efficient retrieval. Recent approaches perform metric loss and quantization loss to supervise the two procedures that cluster samples with the same categories and alleviate semantic information loss after binarization in the end-to-end training framework. However, we observe the incompatible conflict that the optimal cluster positions are not identical to the ideal hash positions because of the different objectives of the two loss terms, which lead to severe ambiguity and error-hashing after the binarization process. To address the problem, we borrow the Theory of Minimum-Distance Bounds for Binary Linear Codes to design the inflection point that depends on the hash bit length and category numbers and thereby propose Hashing-guided Hinge Function (HHF) to explicitly enforce the termination of metric loss to prevent the negative pairs unlimited alienated. Such modification is proven effective and essential for training, which contributes to proper intra- and inter-distances for clusters and better hash positions for accurate image retrieval simultaneously. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO justify that HHF consistently outperforms existing techniques and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF.

Abstract:
Maintaining spatial and temporal consistency in the inpainted video area of the video is a challenging problem. Recent research focuses on flow information for synthesizing temporally smooth pixels while neglecting semantic structural coherence across the video frames. Thus, it suffers from over-smoothing and shadowy outlines that significantly degrade the inpainted video quality. We propose an end-to-end consistent video inpainting model that will substantially improve the inpainted video region to overcome this problem. The model employs a deep encoder (DE), axial attention block (AAB), style transformer, and decoder to enhance video inpainting with a realistic structure. A deep encoder (DE) encodes features effectively while the axial attention block (AAB) recreates all retrieved attributes by merging recoverable multi-scale characteristics with local spatial structures. Then, a novel-style transformer with the style manipulation block (SMB) fills the missing area with rich visual elements and temporal coherence. We use two publicly available benchmark datasets to assess the model's performance. Experimental results demonstrate that our method performs better than the state-of-the-art methods by a large margin. Besides, an extensive ablation study validates the model's performance.

Abstract:
As an increasingly popular task in multimedia information retrieval, video moment retrieval (VMR) aims to localize the target moment from an untrimmed video according to a given language query. Most previous methods depend heavily on numerous manual annotations (i.e., moment boundaries), which are extremely expensive to acquire in practice. In addition, due to the domain gap between different datasets, directly applying these pre-trained models to an unseen domain leads to a significant performance drop. In this paper, we focus on a novel task: cross-domain VMR, where fully-annotated datasets are available in one domain (“source domain”), but the domain of interest (“target domain”) only contains unannotated datasets. As far as we know, we present the first study on cross-domain VMR. To address this new task, we propose a novel Multi-Modal Cross-Domain Alignment (MMCDA) network to transfer the annotation knowledge from the source domain to the target domain. However, due to the domain discrepancy between the source and target domains and the semantic gap between videos and queries, directly applying trained models to the target domain generally leads to a performance drop. To solve this problem, we develop three novel modules: (i) a domain alignment module is designed to align the feature distributions between different domains of each modality; (ii) a cross-modal alignment module aims to map both video and query features into a joint embedding space and to align the feature distributions between different modalities in the target domain; and (iii) a specific alignment module tries to obtain the fine-grained similarity between a specific frame and the given query for optimal localization. By jointly training these three modules, our MMCDA can learn domain-invariant and semantic-aligned cross-modal representations. Extensive experiments on three challenging benchmarks (ActivityNet Captions, Charades-STA and TACoS) illustrate that our cross-domain method MMCDA outperforms all state-of-the-art single-domain methods. Impressively, MMCDA raises the performance by more than 7% in representative cases, which demonstrates its effectiveness.

Abstract:
We present a novel network to transfer the image-language pre-trained model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and language from a large-scale video-text dataset. Differently, we leverage the pre-trained image-language model, and simplify it as a two-stage framework including co-learning of image and text, and enhancing temporal relations between video frames and video-text respectively. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pre-training (CLIP) model, our model involves a Temporal Difference Block (TDB) to capture motions at fine temporal video frames, and a Temporal Alignment Block (TAB) to re-align the tokens of video clips and phrases and enhance the cross-modal correlation. These two temporal blocks efficiently realize video-language learning and enable the proposed model to scale well on comparatively small datasets. We conduct extensive experimental studies including ablation studies and comparisons with existing SOTA methods, and our proposed approach outperforms them on the popularly-employed text-to-video and video-to-text retrieval benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.

Abstract:
Data augmentation has become one of the keys to alleviating the over-fitting of models on training data and improving the generalization capabilities on testing data. Most existing data augmentation methods only focus on one modality, which is incapable when facing multiple data modalities. Some prior works try to interpolate with random coefficients in the latent space to generate new samples, which can generically work for any data modality. However, these works ignore the extra information conveyed by multimodality data. In fact, the extra information in one modality can provide semantic directions to generate more meaningful samples in another modality. This paper proposes Cross-modal Data Augmentation (CMDA), a simple yet effective data augmentation method to alleviate the over-fitting issue and improve the generalization performance. We evaluate CMDA on unsupervised and supervised tasks of different modalities, on which CMDA consistently and significantly outperforms baselines. For instance, CMDA improves the unsupervised anomaly detection baseline in vision modality from the AUROC 76.46%, 73.07% and 64.36% to 83.25%, 76.22% and 70.57% on three different datasets, respectively. Besides, extensive experiments demonstrate that CMDA is applicable to various neural network architectures. Furthermore, prior methods that interpolate in the latent space need to work with downstream tasks to construct the latent space. In contrast, CMDA can work with or without downstream tasks, which makes the applicability of CMDA more extensive. The source code is publicly available for non-commercial or research use at https://github.com/Anfeather/CMDA

Abstract:
In recent years, the Transformer architecture has shown its superiority in the video-based person re-identification task. Inspired by video representation learning, these methods mainly focus on designing modules to extract informative spatial and temporal features. However, they are still limited in extracting local attributes and global identity information, which are critical for the person re-identification task. In this paper, we propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules to address the above issue. Specifically, MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips, respectively, achieving the holistic perception of the input person. We combine the outputs of all the stages for the final identification. In practice, to save the computational cost, the Spatial-Temporal Aggregation (STA) modules are first adopted in each stage to conduct the self-attention operations along the spatial and temporal dimensions separately. We further introduce the Attribute-Aware and Identity-Aware Proxy embedding modules (AAP and IAP) to extract the informative and discriminative feature representations at different stages. All of them are realized by employing newly designed self-attention operations with specific meanings. Moreover, temporal patch shuffling is also introduced to further improve the robustness of the model. Extensive experimental results demonstrate the effectiveness of the proposed modules in extracting the informative and discriminative information from the videos, and illustrate the MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.

Abstract:
Most face-inpainting methods perform well in face repair. However, these methods can only complete a single face image per input. Although existing various image-inpainting methods can achieve pluralistic image inpainting, they typically produce faces with distorted structures or the same texture. To resolve these shortcomings and achieve high-quality diverse face inpainting, we propose PFTANet, a two-stage pluralistic face-inpainting network that transforms attribute information. In the first stage, the face-parsing network is fine-tuned to obtain semantic facial region information. In the second stage, a generator consisting of SNBlock, CF_ShiftBlocks, and CF_MergeBlock, which ensures that high-quality pluralistic face results are generated, is used. Specifically, CF_ShiftBlocks completes pluralistic face generation by transforming the attribute information from the conditional face extracted by the attribute extractor and ensuring the consistency of the attribute information between the conditional and generated faces. CF_MergeBlock ensures structural consistency between the masked and background regions of the generated face using facial region semantic information. A multi-patch discriminator is used to enhance facial detail generation. Experimental results for the CelebA and CelebA-HQ datasets indicated that PFTANet achieved pluralistic and visually realistic face inpainting.

Abstract:
There has been significant success in recent image-to-image translation (I2I) approaches in translating the source image into the style of the target image. Existing techniques rely on the disentanglement of content and style representations, requiring a two-stage style mapping process: Reference images are used to extract style vectors, which are subsequently remapped into the translated images. However, when the target domain contains a variety of styles, such a two-stage style mapping cannot guarantee the translated image be style consistent with its guided reference image. In this work, we propose to explicitly employ metric learning to enhance the two-stage style mapping in style-guided image translation. The distance between deep features Gram matrices is utilized to construct the visual style metric as self-supervised similarity labels, guiding the embedding of style vectors using triplet loss with adaptive margins in the first stage. Furthermore, in the second stage, we consider generated images and their corresponding reference images as positive samples and anchors for each other, while the nearest negative sample is used to construct the triplet loss in the proposed metric space. The proposed learning algorithms can be applied to any I2I framework that uses disentangled representations without modifying the original network architectures. We evaluate the proposed method on three representative I2I translation baselines. Both qualitative and quantitative results demonstrate that the proposed approach enhances style alignment in style-guided translation compared to the baselines.

Abstract:
Skeleton-based action recognition has been substantially driven by the development of artificial intelligence technology and deep sensors. Recently, graph convolutional networks (GCNs) have achieved excellent performances in skeleton-based action recognition. However, the performances of GCN-based methods are impaired by inappropriate node partitioning strategy and obstructed long-range information flow. To solve these issues, a novel Select-Assemble-Normalize Graph Convolution Network (SAN-GCN) is proposed to model the spatio-temporal features of skeleton. First, all skeleton joints are selected as root nodes, and the neighborhoods of the root joints are assembled and normalized according to the body structure, which explicitly and interpretably expresses the spatial geometry relation of the skeleton joints. Second, we propose an attention-based assembly and normalization strategy to adaptively capture non-local joints. The adaptive assembly and normalization can avoid the dilution of key long-range features. Moreover, a bi-level aggregation strategy is introduced to learn spatio-temporal dependencies of joints, where the low-level aggregation aligns the normalized neighborhood graphs, and the high-level aggregation aggregates the features of neighbor nodes by a standard convolution kernel. In high-level aggregation, it is convenient to realize factorized spatio-temporal aggregation or unified spatio-temporal aggregation. Extensive experiments on four datasets with different numbers of action patterns demonstrate that our model achieves comparable performance with the state-of-the-art works.

Abstract:
Massive Multiple-Input Multiple-Output (MIMO), with its spatial multiplexing and channel hardening, has the potential to provide high capacity and reliability for massive video services. In spite of this, massive MIMO in the physical layer does not fully showcase its abilities in video applications unless it is specifically designed to do so. In this paper, we consider a cell-free massive MIMO system as an edge node for scheduling massive streams, and thus the standard server-to-client transmission is split into server-to-edge and edge-to-client. When the server-to-edge transmission is ideal, edge schedules massive streams to alleviate user conflicts in cell-free massive MIMO. Moreover, we propose a novel edge-to-client grouping algorithm for assigning the streams with severe interference to different time slots. The proposed algorithm achieves an innovative. 7-approximation ratio while keeping the group sizes within a certain range. When the server-to-edge transmission is non-ideal, we design a special transmission framework called Aggressive, wherein the server sends the next video chunk when the current chunk reaches the edge rather than the client. Thus, the proposed framework saves considerable server-to-edge latency compared with the traditional framework. Simulation results show that the proposed user grouping algorithm improves the achievable rate by around 18% and the Aggressive framework improves the average bitrate.

Abstract:
As real-world data become increasingly heterogeneous, multi-view semi-supervised learning has garnered widespread attention. Although existing studies have made efforts towards this and achieved decent performance, they are restricted to shallow models and how to mine deeper information from multiple views remains to be investigated. As a recently emerged neural network, Graph Convolutional Network (GCN) exploits graph structure to propagate label signals and has achieved encouraging performance, and it has been widely employed in various fields. Nonetheless, research on solving multi-view learning problems via GCN is limited and lacks interpretability. To address this gap, in this paper we propose a framework termed Interpretable Multi-view Graph Convolutional Network (IMvGCN). We first combine the reconstruction error and Laplacian embedding to formulate a multi-view learning problem that explores the original space from feature and topology perspectives. In light of a series of derivations, we establish a potential connection between GCN and multi-view learning, which holds significance for both domains. Furthermore, we propose an orthogonal normalization method to guarantee the mathematical connection, which solves the intractable problem of orthogonal constraints in deep learning. In addition, the proposed framework is applied to the multi-view semi-supervised learning task. Comprehensive experiments demonstrate the superiority of our proposed method over other state-of-the-art methods.

Abstract:
For pursuing accurate skeleton-based action recognition, most prior methods combine Graph Convolution Networks (GCNs) with attention-based methods in a serial way. However, they regard the human skeleton as a complete graph, resulting in less variations between different actions (e.g., the connection between the elbow and head in action “clapping hands”). For this, we propose a novel Contrastive GCN-Transformer Network (ConGT) which fuses the spatial and temporal modules in a parallel way. The ConGT involves two parallel streams: Spatial-Temporal Graph Convolution stream (STG) and Spatial-Temporal Transformer stream (STT). STG is designed to obtain action representations maintaining the natural topology structure of the human skeleton. STT is devised to acquire action representations containing the global relationships among joints. Since the action representations produced from these two streams contain different characteristics, and each of them knows little information of the other, we introduce the contrastive learning paradigm to guide their output representations of the same sample to be as close as possible in a self-supervised manner. Through the contrastive learning, they can learn information from each other to enrich the action features by maximizing the mutual information between the two types of action representations. To further improve action recognition accuracy, we introduce the Cyclical Focal Loss (CFL) which can focus on confident training samples in early training epochs, with an increasing focus on hard samples during the middle epochs. We conduct experiments on three benchmark datasets, which demonstrate that our model achieves state-of-the-art performance in action recognition.

Abstract:
Target visual navigation aims at controlling the agent to find a target object based on a monocular visual RGB image in each step. It is crucial for the agent to adapt to new environments. As target visual navigation is a complex task, understanding the behavior of the agent is beneficial for analyzing the reasons for failure. This work focuses on improving the readability and success rate of navigation policies. In this paper, we propose a framework named Skill-based Hierarchical Reinforcement Learning (SHRL) for target visual navigation. SHRL contains a high-level policy and three low-level skills. The high-level policy accomplishes the task by utilizing or stopping low-level skills at each step. Low-level skills are designed to separately solve three sub-tasks, i.e., Search, Adjustment, and Exploration. In addition, we propose an Abstract Representation and two penalty items to feed robust features to the high-level policy. Abstract Representation is designed to focus on selecting low-level skills rather than the details of navigation. Experimental results in the artificial environment AI2-Thor indicate that the proposed method outperforms state-of-the-art by a large margin in unseen indoor environments. Moreover, we also provide case studies to illustrate the advantages of SHRL.

Abstract:
Deep learning-based models have achieved remarkable performance in video super-resolution (VSR) in recent years, but most of these models are less applicable to online video applications. These methods solely consider the distortion quality and ignore crucial requirements for online applications, e.g., low latency and low model complexity. In this paper, we focus on online video transmission in which VSR algorithms are required to generate high-resolution video sequences frame by frame in real time. To address such challenges, we propose an extremely low-latency VSR algorithm based on a novel kernel knowledge transfer method, named the convolutional kernel bypass graft (CKBG). First, we design a lightweight network structure that does not require future frames as inputs and saves extra time for caching these frames. Then, our proposed CKBG method enhances this lightweight base model by bypassing the original network with “kernel grafts,” which are extra convolutional kernels containing the prior knowledge of the external pretrained image SR models. During the testing phase, we further accelerate the grafted multibranch network by converting it into a simple single-path structure. The experimental results show that our proposed method can process online video sequences up to 110 FPS with very low model complexity and competitive SR performance.

Abstract:
Multimodal Image fusion is becoming urgent in multi-sensor information utilization. However, existing end-to-end image fusion frameworks ignore a priori knowledge integration and long-distance dependencies across domains, which brings challenges to the network convergence and global image perception in complex scenes. In this article, a conditional generative adversarial network with transformer (TCGAN) is proposed for multimodal image fusion. The generator is to generate a fused image with the source images content. The discriminators are adopted to distinguish the differences between the fused image and the source images. Adversarial training makes the final fused image to maintain the structural and textural details in the cross-modal images simultaneously. In particular, a wavelet fusion module makes the inputs contain image content from different domains as much as possible. The extracted convolutional features interact in the multiscale cross-modal transformer fusion module to fully complement the associated information. It makes the generator to focus on both local and global context. TCGAN fully considers the training efficiency of the adversarial process and the integrated retention of redundant information. Various experimental results of TCGAN have highlighted targets, rich details, and fast convergence properties on public datasets.

Abstract:
The central idea of contrastive learning is to discriminate between different instances and force different views from the same instance to share the same representation. To avoid trivial solutions, augmentation plays an important role in generating different views, among which random cropping is shown to be effective for the model to learn a generalized and robust representation. Commonly used random crop operation keeps the distribution of the difference between two views unchanged along the training process. In this work, we show that adaptively controlling the disparity between two augmented views along the training process enhances the quality of the learned representations. Specifically, we present a parametric cubic cropping operation, ParamCrop, for video contrastive learning, which automatically crops a 3D cubic by differentiable 3D affine transformations. ParamCrop is trained simultaneously with the video backbone using an adversarial objective, so that it learns to increase the contrastive loss and thus gradually reduces the shared contents between two cropped views. Experiments show that this adaptive and gradual increase in the disparity yielded by ParamCrop is beneficial to learning a strong and generalized representation for downstream tasks, which is shown to be effective on multiple contrastive learning frameworks and video backbones.

Abstract:
Recently, vision transformers (ViTs) have been investigated in fine-grained visual recognition (FGVC) and are now considered state of the art. However, most ViT-based works ignore the different learning performances of the heads in the multi-head self-attention (MHSA) mechanism and its layers. To address these issues, in this paper, we propose a novel internal ensemble learning transformer (IELT) for FGVC. The proposed IELT involves three main modules: multi-head voting (MHV) module, cross-layer refinement (CLR) module, and dynamic selection (DS) module. To solve the problem of the inconsistent performances of multiple heads, we propose the MHV module, which considers all of the heads in each layer as weak learners and votes for tokens of discriminative regions as cross-layer feature based on the attention maps and spatial relationships. To effectively mine the cross-layer feature and suppress the noise, the CLR module is proposed, where the refined feature is extracted and the assist logits operation is developed for the final prediction. In addition, a newly designed DS module adjusts the token selection number at each layer by weighting their contributions of the refined feature. In this way, the idea of ensemble learning is combined with the ViT to improve fine-grained feature representation. The experiments demonstrate that our method achieves competitive results compared with the state of the art on five popular FGVC datasets.

Abstract:
More recently, unsupervised domain adaptation has been introduced to text image recognition tasks for serious domain shift problem, which can transfer knowledge from source domains to target ones. Moreover, in unsupervised domain adaptation for text recognition, there is no label information in the target domain to supervise the domain adaptation, especially at the character. Several existing methods regard a text image as a whole and perform only on global feature adaptation, neglecting local-level feature adaptation, i.e., characters. Others methods only focus their attention on word-level feature alignment while ignoring the categories of local-level characters. To address these issues, we propose a text recognition model via Dual adaptatiOn and Clustering, DOC for short. Regarding word-level, we construct a Global Discriminator for global feature adaptation to reduce text layout bias between source and target domains. Regarding character-level, we propose an Adaptive Feature Clustering (AFC) module, which can extract invariant character features through a local-level discriminator for adaptation. Moreover, it enhances the local-feature adaptation by a clustering scheme, which evaluates the feature adaptation by leveraging the knowledge from the source domain as much as possible. In this way, it can pay more attention to the differences in fine-grained characters. Extensive experiments on benchmark datasets demonstrate that our framework can achieve state-of-the-art performance.

Abstract:
In recent years, Cross-Modal Hashing (CMH) has attracted much attention due to its fast query speed and efficient storage. Previous studies have achieved promising results for Cross-Modal Retrieval (CMR) by discovering discriminative hash codes and modality-specific hash functions. Nonetheless, most existing CMR works are subjected to some restrictions: 1) It is assumed that data of different modalities are fully paired, which is impractical in real applications due to sample missing and false data alignment, and 2) binary regression targets including the label matrix and binary codes are too rigid to effectively learn semantic-preserving hash codes and hash functions. To address these problems, this paper proposes an Adaptive Marginalized Semantic Hashing (AMSH) method which not only enhances the discrimination of latent representations and hash codes by adaptive margins, but can also be used for both paired and unpaired CMR. As a two-step method, in the first step, AMSH generates semantic-aware modality-specific latent representations with adaptively marginalized labels, thereby enlarging the distances between different classes, and exploiting the labels to preserve the inter-modal and intra-modal semantic similarities into latent representations and hash codes. In the second step, adaptive margin matrices are embedded into the hash codes, and enlarge the gaps between positive and negative bits, which improves the discrimination and robustness of hash functions. On this basis, AMSH generates similarity-preserving hash codes and robust hash functions without the strict one-to-one data correspondence requirement. Experiments are conducted on several benchmark datasets to demonstrate the superiority and flexibility of AMSH over some state-of-the-art CMR methods.

Abstract:
Fashion compatibility modeling, which is used to estimate the matching degree of a given set of fashion items, has received increasing attention in recent years. However, existing studies often fail to fully leverage multimodal information or ignore the semantic guidance of clothing categories in elevating the reliability of multimodal information. In this paper, we propose a fashion compatibility modeling approach with a category-aware multimodal attention network, termed as FCM-CMAN. In FCM-CMAN, we focus on enriching and aggregating multimodal representations of fashion items by means of the dynamic representations of categories and a contextual attention mechanism simultaneously. Specifically, considering that category correlations are always dynamic and varied for different fashion items, we design a categorical dynamic graph convolutional network to adaptively learn the semantic correlations between categories. When combined with the multi-layered visual outputs of a convolutional neural network and the surrounding contextual information, multiple content-aware category representations and context-aware attention weights are obtained to better characterize fashion items from different aspects. On this basis, two pieces of aware information are integrated by a multimodal factorized bilinear pooling strategy to generate visual-semantic embeddings, which are further improved by a multi-head self-attention mechanism to capture significant elements related to fashion compatibility. Extensive experiments conducted on the FashionVC and ExpFashion datasets demonstrate the superiority of FCM-CMAN over state-of-the-art methods.

Abstract:
Due to the usage of global similarity, the hashing methods based on predefined hash centers have achieved more accurate retrieval results than the pairwise/triplet-based methods. Nevertheless, the fixed hash centers lack the perception of data distribution and are limited by the pre-determined Hadamard matrix, which consider neither the label semantic information nor the object scale size, resulting in sub-optimal retrieval performance and weak generalization ability. In this article, we (1) adopt the label semantic information to generate self-adaptive hash centers and (2) propose the label-affinity coefficient (lac) that considers the scale size of each label/object appearing in the given image to calculate the real hash centroid for this image. Based on this, we propose Label-affinity Self-adaptive Central Similarity Hashing (LSCSH) for image retrieval. LSCSH consists of a hash code generator module and a hash center adapter module. First, we obtain the label word vector (i.e., the word vector representation of each class label) via the Word2Vector technique to generate and update the hash centers that adapt to the distribution of both label word vectors and generated hash codes. Second, we learn lac to indicate the dominance of different labels corresponding to objects in each given image, which considers the unequal scales of each object (corresponding to a label) to calculate a more accurate hash centroid for each image. Last but not least, we design an asynchronous learning mechanism to enable each hash code and its corresponding hash centroid to adapt to each other dynamically. We conduct extensive experiments on 5 image datasets including CIFAR-10, ImageNet, VOC2012, MS-COCO and NUS-WIDE. The experimental results demonstrate that LSCSH can achieve the state-of-the-art visual retrieval performance on both single-label and multi-label image datasets.

Abstract:
The point cloud is a densely distributed 3D (three-dimensional) data, and annotating the point cloud is a time-consuming and labor-intensive work. The existing semantics segmentation work adopts few-shot learning to reduce the dependence on labeling samples while improving the generalization of the model to new categories. Since point clouds are 3D structures with rich geometric features, even objects of the same category have feature differences that cannot be ignored. Therefore, a few samples (support set) used to train the model do not cover all the features of this category. There is a distribution difference between the support samples and the samples used to verify the model performance (query set). In this paper, we propose an efficient point cloud few-shot segmentation method based on prototypes for bias rectification. A prototype is a vector representation of a category in the metric space. To make the prototype representation of the support set closer to the query set features, we define a feature bias term and reduce the distribution distance between the two sets by fusing the support set features and the bias term. On this basis, we design a feature cross-reference module. By mining the co-occurring features of the support and query sets, it can generate a more representative prototype which captures the overall features of the point cloud. Extensive experiments on two challenging datasets demonstrate that our method outperforms the state-of-the-art method by an average of 3.31% in several N-way K-shot tasks, and achieves approximately 200 times faster reasoning speed.

Abstract:
Image-text retrieval (ITR) is a challenging task in the field of multimodal information processing due to the semantic gap between different modalities. In recent years, researchers have made great progress in exploring the accurate alignment between image and text. However, existing works mainly focus on the fine-grained alignment between image regions and sentence fragments, which ignores the guiding significance of context background information. Actually, integrating the local fine-grained information and global context background information can provide more semantic clues for retrieval. In this paper, we propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module, which enhances the semantic corresponding relations between the local and global information, and obtains more accurate feature representations for the image and text modalities. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment. To justify the proposed model, we perform extensive experiments on MS-COCO and Flickr30K datasets. Experimental results show that the proposed HGAN outperforms the state-of-the-art methods on both datasets, which demonstrates the effectiveness and superiority of our model.

Abstract:
Deep multi-view representation learning focuses on training a unified low-dimensional representation for data with multiple sources or modalities. With the rapidly growing attention of graph neural networks, more and more researchers have introduced various graph models into multi-view learning. Although considerable achievements have been made, most existing methods usually propagate information in a single view and fuse multiple information only from the perspective of attributes or relationships. To solve the aforementioned problems, we propose an efficient model termed Dual Fusion-Propagation Graph Neural Network (DFP-GNN) and apply it to deep multi-view clustering tasks. The proposed method is designed with three submodules and has the following merits: a) The proposed view-specific and cross-view propagation modules can capture the consistency and complementarity information among multiple views; b) The designed fusion module performs multi-view information fusion with the attributes of nodes and the relationships among them simultaneously. Experiments on popular databases show that DFP-GNN achieves significant results compared with several state-of-the-art algorithms.

Abstract:
It is valuable and promising to remove post-processing non-maximum suppression (NMS) for object detectors, making detectors simpler and purely end-to-end. Removing NMS is possible if the object detector can identify only one positive sample for prediction for each ground-truth object instance in an image. In this work, we propose a compact and plug-in head, named PSS head, which can be attached to any one-stage detectors to make them NMS-free. Specifically, the PSS head works by automatically selecting a positive sample for each instance to be detected, so that the detectors with our PSS head can directly remove NMS. The success of our PSS head lies in three aspects, namely one-to-one label assignment, stop-gradient operation for eliminating optimization conflicts, and the pss loss and ranking loss specifically designed for the PSS head. Experiments on the COCO dataset demonstrate the effectiveness of our method. In particular, when compared with stage-of-the-art NMS-free methods, our VFNETPSS (attaching PSS head to VFNET) achieves 44.0% mAP, which exceeds the 41.5% mAP of DeFCN with a large margin. When taking Res2Net-101-DCN as backbone network, our VFNETPSS achieves 50.3% mAP on the COCO test set, which is a promising performance even among NMS-based methods.

Abstract:
Session-based recommendation (SBR) aims to predict the next item at a certain time point based on anonymous user behavior sequences. Existing methods typically model session representation based on simple item transition information. However, since session-based data consists of limited users' short-term interactions, modeling session representation by capturing fixed item transition information from a single dimension suffers from data sparsity. In this paper, we propose a novel contrastive multi-level graph neural networks (CM-GNN) to better exploit complex and high-order item transition information. Specifically, CM-GNN applies local-level graph convolutional network (L-GCN) and global-level graph convolutional network (G-GCN) on the current session and all the sessions respectively, to effectively capture pairwise relations over all the sessions by aggregation strategy. Meanwhile, CM-GNN applies hyper-level graph convolutional network (H-GCN) to capture high-order information among all the item transitions. CM-GNN further introduces an attention-based fusion module to learn pairwise relation-based session representation by fusing the item representations generated by L-GCN and G-GCN. CM-GNN averages the item representations obtained by H-GCN to obtain high-order relation-based session representation. Moreover, to convert the high-order item transition information into the pairwise relation-based session representation, CM-GNN maximizes the mutual information between the representations derived from the fusion module and the average pool layer by contrastive learning paradigm. We conduct extensive experiments on several widely used benchmark datasets to validate the efficacy of the proposed method. The encouraging results demonstrate that our proposed method outperforms the state-of-the-art SBR techniques.

Abstract:
Referring Expression Comprehension (REC) aims to locate the target object in the image according to a referring expression. This is a challenging task owing to the need for understanding both natural language and visual information and interpretable reasoning between them. Most existing implicit reasoning-based REC methods lack interpretability, while explicit reasoning-based REC methods have lower accuracy. To achieve competitive accuracy while providing adequate interpretability, in this work, we propose a novel explicit reasoning-based method named InterREC. First, in order to address the challenge of multi-modal understanding, we design two neural network modules based on text-image representation learning: a Text-Region Matching Module to align objects in the image and noun phrases in the expression, and a Text-Relation Matching Module to align relations between objects in the image and relational phrases in the expression. Additionally, we design a Reasoning Order Tree for handling complex expressions, which can reduce complex expressions to multiple object-relation-object triplets and therefore identify the inference order and reduce the difficulty of reasoning. At the same time, to achieve an interpretable reasoning step, we design a Bayesian Network-based explicit reasoning method. Based on the comparative evaluation on various datasets, our method achieves higher accuracy than existing explicit reasoning-based REC methods, and the visualization results demonstrate the method's high interpretability.

Abstract:
The amount of multi-modal data available on the Internet is enormous. Cross-modal hash retrieval maps heterogeneous cross-modal data into a single Hamming space to offer fast and flexible retrieval services. However, existing cross-modal methods mainly rely on the feature-level similarity between multi-modal data and ignore the relationship between relative rankings and label-level fine-grained similarity of neighboring instances. To overcome these issues, we propose a novel Deep Cross-modal Hashing based on Semantic Consistent Ranking (DCH-SCR) that comprehensively investigates the intra-modal semantic similarity relationship. Firstly, to the best of our knowledge, it is an early attempt to preserve semantic similarity for cross-modal hashing retrieval by combining label-level and feature-level information. Secondly, the inherent gap between modalities is narrowed by developing a ranking alignment loss function. Thirdly, the compact and efficient hash codes are optimized based on the common semantic space. Finally, we use the gradient to specify the optimization direction and introduce the Normalized Discounted Cumulative Gain (NDCG) to achieve varying optimization strengths for data pairs with different similarities. Extensive experiments on three real-world image-text retrieval datasets demonstrate the superiority of DCH-SCR over several state-of-the-art cross-modal retrieval methods.

Abstract:
Given a 2D image query and a pool of 3D objects, the goal of image-object retrieval is to rank the 3D objects according to how well their content fits the query. Previous methods usually project 2D images and 3D objects into a joint embedding space and minimize the distance metric to complete the retrieval task. Since 2D images and 3D objects come from two different domains with large discrepancy, even when 3D objects and 2D images are mapped to a shared space, the gap in feature distribution remains significant, which always leads to domain misalignment. In this work, we propose a novel image-object retrieval method by leveraging optimal transport theory. Specifically, to tackle the dimensionality gap between 2D images and 3D objects, we first represent a 3D object via a sequence of its 2D projections. We then design a Cross-Domain View Attention module (CDVA) to automatically compute the optimal combination of 3D object projections given a 2D query image. Next, we exploit Weighted Optimal Transport (WOT)-based distance to depict the discrepancy between 2D images and 3D objects, and reduce the discrepancy to achieve instance-level alignment. Through this scheme, the transported 2D images and 3D objects with the same label are enforced to follow similar distributions. Finally, we design an explicit Category Centroid Alignment module (CCA) to achieve class-level alignment to improve the retrieval performance. Extensive experiments show that our method can achieve competitive performance on the MI3DOR and MI3DOR-2 benchmarks.

Abstract:
Most image retrieval works aim at learning discriminative visual features, while little attention is paid to the retrieval efficiency. The speed of feature extraction is key to the real-world system. Therefore, in this article, we focus on network pruning for image retrieval acceleration. Different from the classification models predicting discrete categories, image retrieval models usually extract continuous features for retrieval, which are more sensitive to network pruning. Such different characteristics of the retrieval and classification models make the traditional pruning method sub-optimal for image retrieval acceleration. Two points are critical for pruning image retrieval models: preserving the local geometry structure of filters and maintaining the model capacity during pruning. In view of the above considerations, we propose a Progressive Local Filter Pruning (PLFP) method. Specifically, we analyze the local geometry of filter distribution in every layer and select redundant filters according to one new criterion that the filter can be replaced locally by other similar filters. Furthermore, to preserve the model capacity of the original model, the proposed method progressively prune the filter by decreasing the scale of filter weights gradually. We evaluate our method on four scene retrieval datasets, i.e., Oxford5K, Oxford105K, Paris6K, and Paris106K, and one person re-identification dataset, i.e., Market-1501. Extensive experiments show that the proposed method (1) preserves the original model capacity while pruning (2) and achieves superior performance to other widely-used pruning methods.

Abstract:
Recently, we have observed an exponential increase of user-generated content (UGC) videos. The distinguished characteristic of UGC videos originates from the video production and delivery chain, as they are usually acquired and processed by non-professional users before uploading to the hosting platforms for sharing. As such, these videos usually undergo multiple distortion stages that may affect visual quality before ultimately being viewed. Inspired by the increasing consensus that the optimization of the video coding and processing shall be fully driven by the perceptual quality, in this paper, we propose to study the quality of the UGC videos from both objective and subjective perspectives. We first construct a UGC video quality assessment (VQA) database, aiming to provide useful guidance for the UGC video coding and processing in the hosting platform. The database contains source UGC videos uploaded to the platform and their transcoded versions that are ultimately enjoyed by end-users, along with their subjective scores. Furthermore, we develop an objective quality assessment algorithm that automatically evaluates the quality of the transcoded videos based on the corrupted reference, which is in accordance with the application scenarios of UGC video sharing in the hosting platforms. The information from the corrupted reference is well leveraged and the quality is predicted based on the inferred quality maps with deep neural networks (DNN). Experimental results show that the proposed method yields superior performance. Both subjective and objective evaluations of the UGC videos also shed lights on the design of perceptual UGC video coding.

Abstract:
Due to the widespread popularity of social media, researchers have developed a strong interest in learning the personalized image aesthetics of online users. Personalized image aesthetics assessment (PIAA) aims to study the aesthetic preferences of individual users for images, which should be affected by the properties of both users and images. Existing PIAA approaches usually use the generic aesthetics learned from images as a prior model and adapt it to PIAA models through a small number of data annotated by individual users. However, the prior model merely learns the objective attributes of images, which is agnostic to the subjective attributes of users, complicating efficient learning of the personalized image aesthetics of individual users. Therefore, we propose a personalized image aesthetics assessment method that integrates the subjective attributes of users and objective attributes of images simultaneously. To characterize these two attributes jointly, an attribute extraction module is introduced to learn users’ personality traits and image aesthetic attributes. Then, an aesthetic prior model is built from numerous individual users’ annotated data, which leverages the personality traits of users and the aesthetic attributes of rated images as prior knowledge to model both the image aesthetic distribution and users’ residual scores relative to generic aesthetics simultaneously. Finally, a PIAA model is obtained by fine-tuning the aesthetic prior model with an individual user’s annotated data. Experiments demonstrate that the proposed method is superior to existing PIAA methods in learning individual users’ personalized image aesthetics.

Abstract:
Recent adversarial attack works attempt to improve the transferability by applying various differentiable transformations on input images. Considering the differentiable transformations and the original model together as a new model, these methods can be regarded as model augmentation that effectively derives an ensemble of models from the single original model. Despite their impressive performance, the model augmentation policies used in these methods are manually designed by experimental attempts, leaving the design of model augmentation policy an open question. In this paper, we propose an Automatic Model Augmentation (AutoMA) approach to find a strong model augmentation policy for transferable adversarial attacks. Specifically, we design a discrete search space that contains various diffierentiable transformations with different parameters and adopt reinforcement learning to search for the strong augmentation policy. The sampled augmentation policies together with the rewards they obtain during the searching process reveal several valuable observations for designing more powerful attacks using model augmentation policy: 1) Augmentation transformations on color space are less effective; 2) The transformation type diversity matters; and 3) Using small distortion for geometric transformations while larger distortion for intensity transformations. Extensive experiments show that the augmentation policy found by AutoMA achieves superior performance than existing manually designed policies in a wide range of cases.

Abstract:
The low-rank matrix completion has gained rapidly increasing attention from researchers in recent years for its efficient recovery of the matrix in various fields. Numerous studies have exploited the popular neural networks to yield low-rank outputs under the framework of low-rank matrix factorization. However, due to the discontinuity and nonconvexity of rank function, it is difficult to directly optimize the rank function via back propagation. Although a large number of studies have attempted to find relaxations of the rank function, e.g., Schatten-p norm, they still face the following issues when updating parameters via back propagation: 1) These methods or surrogate functions are still non-differentiable, bringing obstacles to deriving the gradients of trainable variables. 2) Most of these surrogate functions perform singular value decomposition upon the original matrix at each iteration, which is time-consuming and blocks the propagation of gradients. To address these problems, in this paper, we develop an efficient block-wise model dubbed differentiable low-rank learning (DLRL) framework that adopts back propagation to optimize the Multi-Schatten-p norm Surrogate (MSS) function. Distinct from the original optimization of this surrogate function, the proposed framework avoids singular value decomposition to admit the gradient propagation and builds a block-wise learning scheme to minimize values of Schatten-p norms. Accordingly, it speeds up the computation and makes all parameters in the proposed framework learnable according to a predefined loss function. Finally, we conduct substantial experiments in terms of image recovery and collaborative filtering. The experimental results verify the superiority of the proposed framework in terms of both runtimes and learning performance compared with other state-of-the-art low-rank optimization methods. Our codes are available at https://github.com/chenzl23/DLRL.

Abstract:
Light field (LF) cameras, which can record real-word scenes from multiple viewpoints in a single shot, are widely used in 3D reconstruction, re-focusing, and virtual reality etc. However, the inherent trade-off between spatial resolution and angular resolution of LF images hinders their applications for scenarios requiring high resolutions. In this paper, we propose a novel intra-inter view interaction network for LF image super-resolution, termed as LF-IINet, to exploit the correlations among all views and simultaneously preserve the parallax structure of LF views. The proposed LF-IINet consists of two parallel branches. Specifically, the top branch extracts global inter-view information, and the bottom branch first independently maps each view to deep representations and then models the correlations among all intra-view features via proposed multi-view context block (MCB). The two branches interact with each other by proposed inter-assist-intra feature updating module (IntraFUM, where the intra feature are updated with the assistance of the inter feature) and intra-assist-inter feature updating module (InterFUM, where the inter feature are updated with the assistance of the intra feature). In this way, our LF-IINet incorporates rich angular and spatial information for LF image super-resolution. Extensive comparison with state-of-the-art methods demonstrates that our method achieves superior performance visually and quantitatively. Furthermore, quantitative results also show that our method is effective for LF images with either small or large disparities. Our code is shared in https://github.com/GaoshengLiu/LF-IINet.

Abstract:
Weakly supervised instance segmentation with image-level class supervision is a challenging task as it associates the highest-level instances to the lowest-level appearance. Previous approaches for the task utilize classification networks to obtain rough discriminative parts as seed regions and use distance as a metric to cluster pixels of the same instances. Unlike previous approaches, we provide a novel self-supervised joint learning framework as the basic network and consider the clustering problem as calculating the probability that pixels belong to each instance. To this end, we propose our self-supervised joint learning two-stream network (SJLT Net) to finish this task. In the first stream, we leverage a joint learning framework to implement image-level supervised semantic segmentation with self-supervised saliency detection. In the second stream, we propose a Center Detection Network to detect different instances’ centers with the gaussian loss function to cluster instances pixels. Besides, an integration module is utilized to combine information of both streams and get precise pseudo instances labels. Our approach generates pseudo instance segmentation labels of training images, which are used to train a fully supervised model. Our model achieves excellent performance on the PASCAL VOC 2012 dataset, surpassing the best baseline trained with the same labels by 4.6% AP^r_50 on the train set and 2.6% AP^r_50 on the validation set.

Abstract:
In this paper, we propose a simple yet effective self-supervised method called spatio-temporal contrastive learning (ST-CL) for 3D skeleton-based action recognition. ST-CL acquires action-specific features by regarding the spatio-temporal continuity of motion tendency as the supervisory signal. To yield effective representations, ST-CL first designs some novel contrastive proxy tasks by providing different spatio-temporal observation scenes for the same 3D action and pulling them together in the embedding space. Second, three key components are devised in the action encoding to efficiently extract representations in contrastive tasks: (1) Information Representation introduces the awareness of joint type when analyzing motion dynamics. (2) Non-local GCN learns a data-driven graph topology structure and promotes a spatial message passing among long-range joints in each frame. (3) Multi-Scale TCN makes larger receptive fields for capturing richer longe-range temporal dynamics amomg adjacent frames. In ST-CL, these effective proxy tasks yield useful representations and efficient action encoding further enhances the representation capacity. As validated on four large-scale datasets, ST-CL is a strong baseline with high performance and efficiency for the contrastive learning study of the skeleton data. Compared to previous self-supervised methods, the proposed ST-CL achieves significant improvement consistently with a smaller model size and better training efficiency.

Abstract:
This work aims to temporally localize events that are both audible and visible in video. Previous methods mainly focused on temporal modeling of events with simple fusion of audio and visual features. In natural scenes, a video records not only the events of interest but also ambient acoustic noise and visual background, resulting in redundant information in the raw audio and visual features. Thus, direct fusion of the two features often causes false localization of the events. In this paper, we propose a co-attention model to exploit the spatial and semantic correlations between the audio and visual features, which helps guide the extraction of discriminative features for better event localization. Our assumption is that in an audio-visual event, shared semantic information between audio and visual features exists and can be extracted by attention learning. Specifically, the proposed co-attention model is composed of a co-spatial attention module and a co-semantic attention module that are used to model the spatial and semantic correlations, respectively. The proposed co-attention model can be applied to various event localization tasks, such as cross-modality localization and multimodal event localization. Experiments on the public audio-visual event (AVE) dataset demonstrate that the proposed method achieves state-of-the-art performance by learning spatial and semantic co-attention.

Abstract:
Achieving reliable acoustic wireless video transmissionsin the extreme and uncertain underwater environment is a challenge due to the limited bandwidth and the error-prone nature of the channel. Aiming at optimizing the received video quality and the user’s experience, an adaptive solution for underwater video transmissions is proposed that is specifically designed for Multi-Input Multi-Output (MIMO)-based Software-Defined Acoustic Modems (SDAMs). To keep the video distortion under an acceptable threshold and to keep the Physical-Layer Throughput (PLT) high, cross-layer techniques utilizing diversity-spatial multiplexing and Unequal Error Protection (UEP) are presented along with the scalable video compression at the application layer. Specifically, the scalability of the utilized SDAM with high processing capabilities is exploited in the proposed structure along with the temporal, spatial, and quality scalability of the Scalable Video Coding (SVC) H.264/MPEG-4 AVC compression standard. The transmitter broadcasts one video stream and realizes multicasting at different users. Experimental results at the Sonny Werblin Recreation Center, Rutgers University-NJ, are presented. Several scenarios for unknown channels at the transmitter are experimentally considered when the hydrophones are placed in different locations in the pool to achieve the required SVC-based video Quality of Service (QoS) and Quality of Experience (QoE) given the channel state information and the robustness of different SVC scalability. The video quality level is determined by the best communication link while the transmission scheme is decided based on the worst communication link, which guarantees that each user is able to receive the video with appropriate quality.

Abstract:
In this paper, we propose a cross-modal variational auto-encoder (CMVAE) for content-based micro-video background music recommendation. CMVAE is a hierarchical Bayesian generative model that matches relevant background music to a micro-video by projecting these two multimodal inputs into a shared low-dimensional latent space, where the alignment of two corresponding embeddings of a matched video-music pair is achieved by cross-generation. Moreover, the multimodal information is fused by the product-of-experts (PoE) principle, where the semantic information in visual and textual modalities of the micro-video are weighted according to their variance estimations such that the modality with a lower noise level is given more weights. Therefore, the micro-video latent variables contain less irrelevant information that results in a more robust model generalization. Furthermore, we establish a large-scale content-based micro-video background music recommendation dataset, TT-150k, composed of extracted features from approximately 3,000 different background music clips associated to 150,000 micro-videos from different users. Extensive experiments on the established TT-150k dataset demonstrate the effectiveness of the proposed method. A qualitative assessment of CMVAE by visualizing some recommendation results is also included.

Abstract:
Detecting salient objects in videos is a very challenging task. Current state-of-the-art methods are dominated by motion based deep neural networks, among which optical flow is often leveraged as motion representation. Though with robust performance, these optical flow-based video salient object detection methods face at least two problems that may hinder their generalization and application. First, computing optical flow as a pre-processing step does not support direct end-to-end learning; second, little attention has been given to the quality of visual features due to high computational cost of spatiotemporal feature encoding. In this paper we propose a novel self-sufficient feature enhancing network (SFENet) for video salient object detection, which leverages optical flow estimation as an auxiliary task while being end-to-end trainable. With a joint training scheme of both salient object detection and optical flow estimation, its multi-task architecture can be totally self-sufficient for achieving good performance without any pre-processing. Furthermore, for improving feature quality, we design four lightweight modules in spatial and temporal domains, including cross-layer fusion, multi-level warping, spatial-channel attention and boundary-aware refinement. The proposed method is evaluated through extensive experiments on five video salient object detection datasets. Experimental results show that our SFENet can be easily trained with fast convergence speed. It significantly outperforms previous methods in terms of various evaluation metrics. Moreover, with optical flow estimation and unsupervised video object segmentation as example applications, our method also yields state-of-the-art results on standard datasets.

Abstract:
We study the task of single person dense pose estimation. Specifically, given a human-centric image, we learn to map all human pixels onto a 3D, surface-based human body model. Existing methods approach this problem by fitting deep convolutional networks on sparse annotated points where the regression on both surface coordinate components for each body part is uncorrelated and optimized separately. In this work, we devise a novel, unified loss function that explicitly characterizes the correlation for surface coordinates regression, achieving significant improvements in both accuracy and efficiency. Furthermore, based on an observation that the image-to-surface correspondence is intrinsically invariant to geometric transformations from input images, we propose to enforce a geometric equivariance consistency on the target mapping, thereby allowing us to enable reliable supervision on large amounts of unlabeled pixels. We conduct comprehensive studies on the effectiveness of our approach using a quite simple network. Extensive experiments on the DensePose-COCO dataset show that our model achieves superior performance against previous state-of-the-art methods with much less computation complexity. We hope that our work would serve as a solid baseline for future study in the field. The code will be available at https://github.com/Johnqczhang/densepose.pytorch.

Abstract:
Instead of being observed by human, multimedia data are now more and more fed into machines to perform different kinds of semantic analysis. One image may be analyzed multiple times by different machine vision algorithms for different purposes. While machine vision-oriented image compression has been studied, the existing methods are usually driven by a specific machine vision task, and may not be applicable for the other tasks. We address the task-generic image compression, in the hope that an image is compressed once but used multiple times for different tasks, all with satisfactory performance. Our study is based on the end-to-end learned image compression. We focus ourselves on the distortion metric, i.e., finding out a task-agnostic metric to estimate the quality of reconstructed images. On the one hand, we study deep feature distance as the metric, which transforms images into a latent space by a pretrained convolutional network—the latent space is believed to be more aligned to semantics—and calculates distance in the latent space. On the other hand, inspired by the saliency mechanism, we study an importance-weighted pixel distance as the metric, where the weights are generated to reflect the importance of the pixels to semantics. Moreover, we combine the two distances into one metric to investigate their complementary nature. An extensive set of experiments are performed to evaluate these metrics. Experimental results show that using the combined metric performs the best, and leads to 20.79%～42.69% bits saving under the same semantic analysis performance, compared to using signal fidelity metrics. Interestingly, we observe that using the combined metric also improves the visual quality of the reconstructed images.

Abstract:
The rapid growth of rich multimedia data in today’s Internet, especially video traffic, has challenged the content delivery networks (CDNs). Caching serves as an important means to reduce user access latency so as to enable faster content downloads. Motivated by the dynamic nature of the real-world edge traces, this paper introduces a provably well online caching policy in dynamic environments where: 1) the popularity is highly dynamic; 2) no regular stochastic pattern can model this dynamic evaluation process. First, we design an online optimization framework, which aims to minimize the dynamic regret that finds the distance between an online caching policy and the best dynamic policy in hindsight. Second, we propose a dynamic online learning method to solve the non-stationary caching problem formulated in the previous framework. Compared to the linear dynamic regret of previous methods, our proposal is proved to achieve a sublinear dynamic regret, from which it is guaranteed to be nearly optimal. We verify the design using both synthetic and real-world traces: the proposed policy achieves the best performance in the synthetic traces with different levels of dynamicity, which verifies the dynamic adaptation; our proposal consistently achieves at least 9.4% improvement than the baselines, including LRU, LFU, Static Online Learning based replacement, and Deep Reinforcement Learning based replacement, in random edge areas from real-world traces (from iQIYI), further verifying the effectiveness and robustness on the edge.

Abstract:
Human pose transfer aims at transferring the appearance of the source person to the target pose. Existing methods utilizing flow-based warping for non-rigid human image generation have achieved great success. However, they fail to preserve the appearance details in synthesized images since the spatial correlation between the source and target is not fully exploited. To this end, we propose the Flow-based Dual Attention GAN (FDA-GAN) to apply occlusion- and deformation-aware feature fusion for higher generation quality. Specifically, deformable local attention and flow similarity attention, constituting the dual attention mechanism, can derive the output features responsible for deformable- and occlusion-aware fusion, respectively. Besides, to maintain the pose and global position consistency in transferring, we design a pose normalization network for learning adaptive normalization from the target pose to the source person. Both qualitative and quantitative results show that our method outperforms state-of-the-art models in public iPER and DeepFashion datasets.

Abstract:
Learning human 2D-3D correspondences aims to map all human 2D pixels to a 3D human template, namely human densepose estimation, involving surface patch recognition (i.e., Index-to-Patch (I)) and regression of patch-specific UV coordinates. Despite recent progress, it remains challenging especially under the condition of “in the wild”, where RGB images capture real-world scenes with backgrounds, occlusions, scale variations, and postural diversity. In this paper, we address three vital problems in this task: 1) how to perceive multi-scale visual information for instances “in the wild”; 2) how to design learning objectives to address the precise instance representation harassed by “multiple instances in one bounding box” phenomenon; and 3) how to boost the performance of index-to-patch prediction faced by limited supervision. To tackle problems above, we propose an end-to-end deep Adaptive Multi-path Aggregation network (AMA-net) for Human DensePose Estimation. First, we introduce an adaptive multi-path aggregation algorithm to extract varying-sized instance-level features, which capture multi-scale information of a bounding-box and are then utilized for parsing different instances. Second, we adopt an instance augmentation learning objective to further distinguish the target instance from other interference instances. Third, taking advantage of 2D human parsers that are trained from sufficient annotations, we introduce a task transformer that bridges the “gap” between 2D human parsing and densepose estimation, thus benefiting the performance of densepose estimator. Experimental results on the challenging DensePose-COCO dataset demonstrate that our approach sets a new record, and it significantly outperforms the state-of-the-art methods. Codes and models are publicly available.

Abstract:
To leverage the strong cross-frame relations of videos, many video semantic segmentation methods tend to explore feature reuse and feature warping based on motion clues. However, since the video dynamics are too complex to model accurately, some warped feature values may be invalid. Moreover, the warping errors can accumulate across frames, thereby resulting in degraded segmentation performance. To tackle this problem, we present an efficient distortion map-guided feature rectification method for video semantic segmentation, specifically targeting the feature updating and correction on the distorted regions with unreliable optical flow. The updated features for the distorted regions are extracted from a light correction network (CoNet). A distortion map serves as the weighted attention to guide the feature rectification by aggregating the warped features and the updated features. The generation of the distortion map is simple yet effective in predicting the distorted areas in the warped features, i.e., moving boundaries, thin objects, and occlusions. In addition, we propose an auxiliary edge-semantics loss to implement the distorted region supervision with classes. Our network is trained in an end-to-end manner and highly modular. Comprehensive experiments on Cityscapes and CamVid datasets demonstrate that the proposed method has achieved state-of-the-art performance by weighing accuracy, inference speed, and temporal consistency on video semantic segmentation.

Abstract:
In the real-world, some views of samples are often missing for the collected multiview data. Faced with the incomplete multiview data, most of the existing clustering methods tended to learn a common graph from the available views, where the hidden information of the absent views was ignored. Furthermore, some methods filled the absent instances with the average vector of the available samples for each view, which could not reflect a real distribution of the data. To solve these problems, in this paper an intrinsic and complete structure learning based incomplete multiview clustering method (ICSL_IMC) is proposed. Firstly, we calculate the initial complete graphs for all views by exploring the available incomplete graphs, which are further taken as the constraints for the reconstruction of the absent data integrating the self-representation method. Afterwards, encouraged by the complete multiview data, a complete structure inferring strategy is proposed to learn the intrinsic and complete structures for all views, such that the real distribution of the absent instances can be reflected in the completed structure of each view. We integrate these three learning phases into a joint optimization model, which can promote each other in the iterative learning procedure, simultaneously. Comparing with the other state-of-the-art methods, the proposed ICSL_IMC can achieve the best performances on different databases.

Abstract:
Image-text matching, as a fundamental cross-modal task, bridges the gap between vision and language. The core is to accurately learn semantic alignment to find relevant shared semantics in image and text. Existing methods typically attend to all fragments with word-region similarity greater than empirical threshold zero as relevant shared semantics, e.g., via a ReLU operation that forces the negative to zero and maintains the positive. However, this fixed threshold is totally isolated with feature learning, which cannot adaptively and accurately distinguish the varying distributions of relevant and irrelevant word-region similarity in training, inevitably limiting the semantic alignment learning. To solve this issue, we propose a novel Unified Adaptive Relevance Distinguishable Attention (UARDA) mechanism, incorporating the relevance threshold into a unified learning framework, to maximally distinguish the relevant and irrelevant distributions to obtain better semantic alignment. Specifically, our method adaptively learns the optimal relevance boundary between these two distributions to improve the model to learn more discriminative features. The explicit relevance threshold is well integrated into similarity matching, which kills two birds with one stone as: (1) excluding the disturbances of irrelevant fragment contents to aggregate precisely relevant shared semantics for boosting matching accuracy, and (2) avoiding the calculation of irrelevant fragment queries for reducing retrieval time. Experimental results on benchmarks show that UARDA can substantially and consistently outperform state-of-the-arts, with relative rSum improvements of 2%−4% (16.9%−35.3% for baseline SCAN), and reducing the retrieval time by 50%−73%.

Abstract:
Recent advances in unsupervised domain adaptation (UDA) techniques have witnessed great success in cross-domain computer vision tasks, enhancing the generalization ability of data-driven deep learning architectures by bridging the domain distribution gaps. For the UDA-based cross-domain object detection methods, the majority of them alleviate the domain bias by inducing the domain-invariant feature generation via adversarial learning strategy. However, their domain discriminators have limited classification ability due to the unstable adversarial training process. Therefore, the extracted features induced by them cannot be perfectly domain-invariant and still contain domain-private factors, bringing obstacles to further alleviate the cross-domain discrepancy. To tackle this issue, we design a Domain Disentanglement Faster-RCNN (DDF) to eliminate the source-specific information in the features for detection task learning. Our DDF method facilitates the feature disentanglement at the global and local stages, with a Global Triplet Disentanglement (GTD) module and an Instance Similarity Disentanglement (ISD) module, respectively. By outperforming state-of-the-art methods on four benchmark UDA object detection tasks, our DDF method is demonstrated to be effective with wide applicability.

Abstract:
Cross-domain Facial Expression Recognition (FER) aims to safely transfer the learned knowledge from labeled source data to unlabeled target data, which is challenging due to the subtle difference between various expressions and the large discrepancy between domains. Existing methods mainly focus on reducing the domain shift for transferable features but fail to learn discriminative representations for recognizing facial expression, which may result in negative transfer under cross-domain settings. To this end, we propose a novel Deep Margin-Sensitive Representation Learning (DMSRL) framework, which can extract multi-level discriminative features during sematic-aware domain adaptation. Specifically, we design a semantic metric learning module based on the category prior of source data and generated pseudo labels of target data, which can facilitate discriminative intra-domain representation learning and transferable inter-domain knowledge discovery by enlarging the category margin. Moreover, we develop a mutual information minimization module by simultaneously distilling the domain-invariant components and eliminating the domain-sensitive ones, which benefits discriminative transferable feature learning by generating accurate pseudo target labels. Furthermore, instead of only utilizing the global features, we formulate a multi-level feature extracting module to concurrently get the local ones, which contain detailed information to distinguish the small changes among different expressions. These modules are jointly utilized in our DMSRL in an end-to-end manner to ensure the positive transfer of source knowledge. Extensive experimental results on seven databases demonstrate that our DMSRL can achieve superior performance against state-of-the-art baselines.

Abstract:
We introduce a Gaussian Mixture Model (GMM) framework for 3D holoscopic image compression in this paper. The elemental-images of the 3D holoscopic image are predicted using GMM and the parameters of GMM are estimated using the common Expectation-Maximization (EM) algorithm. GMM Model Optimization (GMO) is used in this framework to select the optimal number of distributions and avoid local optimum of EM at the same time. A three-dimensional distribution-rotation based decomposition is proposed to change covariance parameters to meaningful features and improve the coding efficiency. The features and the remaining parameters of the GMM are encoded using fixed-length bits. A feature-based dictionary is proposed in this framework to match the similar Gaussian distributions utilizing the similar GMM features. And the offsets of the matched distributions are recorded as motion vectors to replace the similar areas in the elemental-images of the 3D holoscopic image. The residual between the original image and the prediction is encoded using Screen Content Coding Extension of High Efficiency Video Coding (HEVC-SCC). Experimental results show that our method performs better than HEVC-SCC, two coding methods based on pseudo-sequences and a state-of-the-art content-based compression method with Gaussian process regression.

Abstract:
Estimating absolute 3D poses of multiple people from monocular image is challenging due to the presence of occlusions and the scale variation among different persons. Among the existing methods, the top-down paradigms are highly dependent on human detection which is prone to the influence from inter-person occlusions, while the bottom-up paradigms suffer from the difficulties in keypoint feature extraction caused by scale variation and unreliable joint grouping caused by occlusions. To address these challenges, we introduce a novel multi-person 3D pose estimation framework, aided by multi-scale feature representations and human depth perceiving. Firstly, a waterfall-based architecture is incorporated for multi-scale feature representations to achieve a more accurate estimation of occluded joints with a better detection of human shapes. Then the global and local representations are fused for handling the effects of inter-person occlusion and scale variation in depth perceiving and keypoint feature extraction. Finally, with the guidance of the fused multi-scale representations, a depth-aware model is exploited for better 2D joint grouping and 3D pose recovering. Quantitative and qualitative evaluations on benchmark datasets of MuCo-3DHP and MuPoTS-3D prove the effectiveness of our proposed method. Furthermore, we produce an occluded MuPoTS-3D dataset and the experiments on it validate the superiority of our method for overcoming the occlusions.

Abstract:
Due to the expensive and laborious annotations of labeled data required by fully-supervised learning in the crowd counting task, it is desirable to explore a method to reduce the labeling burden. There exists a large number of unlabeled images in the wild that can be easily obtained compared to labeled datasets. Based on the characteristics of consistent spatial transformation with the annotations of heads and image, this paper proposes a self-supervised learning framework with unlabeled and limited labeled data for pre-training and fine-tuning crowd counting model (SSL-FT). It includes an online network and a target network that receive the same images but are randomly processed by two defined augmentation transformations. We leverage unlabeled data to pre-train the online network based on a self-supervised loss and small-scale labeled data to transfer the model to a specific domain based on a fully-supervised loss. We demonstrate the effectiveness of the SSL-FT on four public datasets including ShanghaiTech PartA, PartB, UCF-QNRF and WorldExpo'10 utilizing a classical counting model. Experimental results show that our approach performs better than state-of-art semi-supervised methods.

Abstract:
Image segmentation is a fundamental building block of automatic medical applications. It has been greatly improved since the emergence of deep neural networks. However, deep-learning based models often require a large number of manual annotations, which has seriously hindered its practical usage. To alleviate this problem, numerous works were proposed by utilizing unlabeled data based on semi-supervised frameworks. Recently, the Mean-Teacher (MT) model has been successfully applied in many scenarios due to its effective learning strategy. Nevertheless, the existing MT model still have certain limitations. Firstly, various sorts of perturbations are often added to the training data to gain extra generalization ability through consistency training. However, if the variation is too weak, it may cause the Lazy Student Phenomenon, and bring large fluctuations to the learning model. On the contrary, large image perturbations may enlarge the performance gap between the teacher and student. In this case, the student may lose its learning momentum, and more seriously, drag down the overall performance of the whole system. In order to address these issues, we introduce a novel semi-supervised medical image segmentation framework, in which a Cross-Mix Teaching paradigm is proposed to provide extra data flexibility, thus effectively avoid Lazy Student Phenomenon. Moreover, a lightweight Transductive Monitor is applied to server as the bridge that connect the teacher and student for active knowledge distillation. In the light of this cross-network information mixing and transfer mechanism, our method is able to continuously explore the discriminative information contained in unlabeled data. Extensive experiments on challenging medical image data sets demonstrate that our method is able to outperform current state-of-the-art semi-supervised segmentation methods under severe lack of supervision.

Abstract:
Unsupervised cross-domain object counting has recently received great attention in computer vision, which generalizes the model from the source domain to the unlabeled target domain. However, it is an extremely challenging task because only unlabeled data is available from the target domain and the domain gap between two domains is implicit in object counting. In this paper, we propose a latent domain generation method to improve the generalization ability of unsupervised domain adaptation object counting by generating a latent domain. To this end, we propose a domain generator with random perturbations to learn a new latent distribution derived from the original source distribution. The latent domain generator can extract target information sampled in its stochastic latent representation, which preserves the original target information and enhances the diverse ability. Meanwhile, to ensure that the generated latent domain is consistent with the source domain in counting performance, we introduce a consistency loss to encourage similar output from latent and source domains. Moreover, to enhance the adaptation ability of the generated latent domain, we apply the adversarial loss to achieve alignment between the latent and target domains. The domain generator with the adversarial loss and consistency loss ensures that the generated domain is aligned to the target while also improving the robustness of the original source domain model. The experiment indicates that our framework can effortlessly extend to scenarios with different objects (crowd, cars). The experiments also demonstrate the effectiveness of our method on unsupervised realistic-to-realistic crowd counting problems.

Abstract:
Recently, Action Recognition (AR) is facing the scalability problem, since collecting and annotating data for the ever-growing action categories is exhausting and inappropriate. As an alternative to AR, Zero-Shot Action Recognition (ZSAR) is getting more and more attention in the community, as they could utilize a shared semantic/attribute space to recognize novel categories without annotated data. Different from the AR focuses on learning the correlation between actions, ZSAR needs to consider the correlation of action-action, label-label and action-label at the same time. However, as far as we know, there is no work to provide structural guidance for the framework design of ZSAR according to its task characteristics. In this paper, we demonstrate the rationality of using the Energy-Based Model (EBM) to guide the framework design of ZSAR based on their inference mechanism. Furthermore, under the guidance of EBM, we propose an Energy-based Temporal Summarized Attentive Network (ETSAN) to achieve ZSAR. Specifically, to ensure the effectiveness of cross-modal matching, EBM needs to capture the correlations of input-input, output-output and input-output, based on discriminative and focused input and output space. To this end, we first design the Temporal Summarized Attentive Mechanism (TSAM) to capture the correlation of action-action by constructing discriminative and focused input space. Then, a Label Semantic Adaptive Mechanism (LSAM) is proposed to learn the correlation of label-label by adjusting the semantic structure according to the target task. Finally, we devise an Energy Score Estimation Mechanism (ESEM) to measure the compatibility (i.e. energy score) between video representation and label semantic embedding. With end-to-end training, our framework can capture all three of the correlations mentioned above simultaneously by minimizing the energy score of the correct action-label pair. Experiments on the HMDB51 and UCF101 datasets show that the proposed architecture achieves comparable results among methods based on the spatial-temporal visual feature of sequence-level, which demonstrates the efficiency of the EBM in guiding the framework design of ZSAR.

Abstract:
Salient instance segmentation (SIS) can be considered as the next generation task for the saliency detection community. Most of the existing state-of-the-art methods used for this novel challenging task are built on the mainstream Mask R-CNN architecture. However, this mechanism relies heavily on hand-designed anchors and NMS post-processing. In this paper, we provide a one stage SIS framework with transformers, termed Orientative Query Transformer (OQTR). To leverage the long-range dependencies of transformers, a cross fusion module is designed to efficiently fuse the global features in the encoder and salient query features for salient mask prediction. Furthermore, derived from the center prior in traditional saliency models, we propose an orientative query that is considered as the initial salient object query to accelerate convergence. In addition, to mitigate the issue of the lack of a large-scale dataset with salient instance labels, we collect a new SIS dataset (SIS10 K) containing over 10 K images elaborately annotated with both object- and instance-level labels to promote the community. Without any post-processing, our end-to-end OQTR framework significantly surpasses the top-1 RDPNet by an average of 13.1% AP scores across all three challenging datasets, demonstrating the strong performance of the proposed OQTR. The code and the dataset proposed in this work are available at: https://github.com/ssecv/OQTR.

Abstract:
In this paper, we present a dynamic convolution kernel (DCK) strategy for convolutional neural networks. Using a fully convolutional network with the proposed DCKs, high-quality talking-face video can be generated from multi-modal sources (i.e., unmatched audio and video) in real time, and our trained model is robust to different identities, head postures, and input audios. Our proposed DCKs are specially designed for audio-driven talking face video generation, leading to a simple yet effective end-to-end system. We also provide a theoretical analysis to interpret why DCKs work. Experimental results show that our method can generate high-quality talking-face video with background at 60 fps. Comparison and evaluation between our method and the state-of-the-art methods demonstrate the superiority of our method.

Abstract:
We aim to address a new task named few-shot early action prediction (FS-EAP) that learns classifiers for novel actions from only a few partially observed videos. We argue that the task is extremely challenging since the partially observed videos do not contain enough action information in a few-shot environment. To tackle this task, in this paper, we propose a scene-aware spatio-temporal graph neural network (SA-STGNN) by leveraging the fine-grained spatio-temporal interactions in the video scenes. Specifically, we first generate a spatio-temporal graph corresponding to the partially observed video to capture comprehensive spatio-temporal correlations. Then we utilize the spatio-temporal graph as the input of our SA-STGNN and predict the augmented video features corresponding to the complete video. The architecture uses several scene-aware learning blocks, which are a combination of edge fusion graph neural layers and temporal gated convolutional layers to jointly model spatial and temporal dependencies. Finally, we employ an early action predictor to exploit the learned video features for predicting actions in the few-shot setting. Extensive experimental results on two widely adopted video datasets demonstrate the effectiveness of our approach and its superior performance over the state-of-the-art approaches.

Abstract:
Multimodal sequence learning aims to utilize information from different modalities to enhance overall performance. Mainstream works often follow an intermediate-fusion pipeline, which explores both modality-specific and modality-supplementary information for fusion. However, the unaligned and heterogeneously distributed multimodal sequences pose significant challenges to the fusion task: 1) to extract both effective unimodal and crossmodal representations and 2) to overcome the overfitting issue in joint multimodal sequence optimization. In this work, we propose regularized expressive representation distillation (RERD) that aims to seek effective multimodal representations and to enhance the generalization of fusion. First, to improve unimodal representation learning, unimodal representations are assigned to multi-head distillation encoders, where the unimodal representations are iteratively updated through distillation attention layers. Second, to alleviate the overfitting issue in joint crossmodal optimization, a multimodal sinkhorn distance regularizer is proposed to reinforce the expressive representation extraction and to reduce the modality gap before fusion adaptively. These representations produce a comprehensive view of the multimodal sequences, which are utilized for downstream fusion tasks. Experimental results on several popular benchmarks demonstrate that the proposed method achieves state-of-the-art performance, compared with widely used baselines for deep multimodal sequence fusion, as shown in https://github.com/Redaimao/RERD.

Abstract:
Zero-shot Learning (ZSL) aims to transfer knowledge from seen image categories to unseen ones by leveraging semantic information. It is generally assumed that the seen and unseen objects share a common semantic space. Most of existing ZSL methods focus on how to connect the visual space and the semantic space. However, since there are some visual distribution differences between seen and unseen objects, the projection function learned by those seen classes is biased when transferring knowledge to unseen classes. We argue that, although the unseen objects are class-agnostic, the visual distribution information of unseen samples can be generated by exploiting semantic features. In this paper, we propose a Compound Projection Learning (CPL) model to transfer knowledge from seen to unseen objects by exploiting the information of both seen and class-agnostic samples. With the projected semantic representation by CPL, effective constraints such as projection loss and semantic reconstruction loss can be explored for seen and unseen objects, respectively, such that the semantic ambiguity across seen and unseen objects is reduced. Additionally, we utilize a similarity network to further explore the inter-class relationship by employing the labels and the similarities between seen and unseen classes. Extensive experiments on ZSL benchmark datasets show the effectiveness of our proposed approach.

Abstract:
Unsupervised domain adaptation is an appealing technique to learn robust classifiers for unlabeled target domain by borrowing knowledge from well-established source domain. However, previous works mainly suffer from two limitations: 1) the classifier trained on labeled source data may be prone to overfitting the source distribution, lowering its performance on the target domain; 2) the adaptation process will be misled by conditional distribution matching using hard pseudo labels of target samples. This paper presents a Dual-Level Adaptive and Discriminative (DLAD) classifier learning framework, in which transfer classifier and distribution adaptation can be mutually beneficial for effective knowledge transfer. Specifically, we aim to achieve a domain-level adaptive classifier by considering structural risk minimization (SRM) on both domains and performing weighted distribution adaptation, which facilitates joint classifier learning in a semi-supervised manner. To further achieve a class-level discriminative classifier, we explicitly leverage unlabeled target data to promote classifier learning based on class probabilities, which refines the decision boundary to be more discriminative for unlabeled target data. To the best of our knowledge, DLAD is the first attempt to consider the principle of SRM on the target domain, which significantly boosts the discriminative power of transfer classifier and yields a tighter generalization bound. Experimental evaluations on several standard cross-domain datasets show that DLAD significantly outperforms other competitive methods.

Abstract:
Video prediction has always been a very challenging problem in video representation learning due to the complexity in spatial structure and temporal variation. However, existing methods mainly predict videos by employing language-based memory structures from the traditional Long Short-Term Memories (LSTMs) or Gated Recurrent Units (GRUs), which may not be powerful enough to model the long-term dependencies in videos, consisting of much more complex spatiotemporal dynamics than sentences. In this paper, we propose a SpatioTemporal Attention based Memory (STAM), which can efficiently improve the long-term spatiotemporal memorizing capacity by incorporating the global spatiotemporal information in videos. In the temporal domain, the proposed STAM aims to observe temporal states from a wider temporal receptive field to capture accurate global motion information. In the spatial domain, the proposed STAM aims to jointly utilize both the high-level semantic spatial state and the low-level texture spatial states to model a more reliable global spatial representation for videos. In particular, the global spatiotemporal information is extracted with the help of an Efficient SpatioTemporal Attention Gate (ESTAG), which can adaptively apply different levels of attention scores to different spatiotemporal states according to their importance. Moreover, the proposed STAM are built with 3D convolutional layers due to their advantages in modeling spatiotemporal dynamics for videos. Experimental results show that the proposed STAM can achieve state-of-the-art performance on widely used datasets by leveraging the proposed spatiotemporal representations for videos.

Abstract:
Scene text recognition is a challenging task in the computer vision field due to the diversity of text styles and the complexity of the image backgrounds. In recent decades, numerous text rectification and recognition methods have been proposed to solve these problems. However, most of these methods rectify texts at the geometry level or pixel level. The former is limited by geometric constraints, and the latter is prone to blurring the text. In this paper, we propose a two-level rectification attention network (TRAN) to rectify and recognize texts. This network consists of two parts: a two-level rectification network (TORN) and an attention-based recognition network (ABRN). Specifically, the TORN first rectifies texts at the geometry level and then performs a pixel-level adjustment, which not only eliminates the geometric constraints but also renders clear texts. The ABRN’s role is to recognize text in the rectified images. To improve the feature extraction ability of our model, we design a new channel-wise and kernel-wise attention unit, which enables the network to handle significant variations of character size and channel interdependencies. Furthermore, we propose a skip training strategy to make our model converge smoothly. We conduct experiments on various benchmarks, including regular and irregular datasets. The experimental results show that our method achieves a state-of-the-art performance.

Abstract:
The unpaired image-to-image translation aims to translate input images from one source domain to some desired outputs in a target domain by learning from unpaired training data. Cycle-consistency constraint provides a general principle to estimate and measure forward and backward mapping functions between two domains. In many cases, the information entropy of images from the two domains is not equal, resulting in an information-rich domain and an information-poor domain. However, existing solutions based on cycle-consistency either completely discard the information asymmetry between the two domains (a common choice), which leads to inferior translation performance for the asymmetric unpaired image-to-image translation, or have to rely on special task-specific designs and introduce extra loss components. These elaborative designs especially for the relatively harder translation direction from the information-poor domain to the information-rich domain (“poor-to-rich” translation) require extra labor and are limited to some specific tasks. In this paper, we propose a novel asynchronous generative adversarial network named Async-GAN, which provides a model-agnostic framework for easily turning symmetrical models into powerful asymmetric counterparts that can handle asymmetric unpaired image-to-image translation much better. The key innovation is to iteratively build gradually-improving intermediate domains for generating pseudo paired training samples, which provide stronger full supervision for assisting the poor-to-rich translation. Extensive experiments on various asymmetric unpaired translation tasks demonstrate the superiority of the proposal. Furthermore, the proposed training framework could be extended to various Cycle-GAN solutions and achieve a performance gain.

Abstract:
While most of the HTTP adaptive streaming (HAS) traffic continues to be video-on-demand (VoD), more users have started generating and delivering live streams with high quality through popular online streaming platforms. Typically, the video contents are generated by streamers and being watched by large audiences which are geographically distributed far away from the streamers’ locations. The locations of streamers and audiences create a significant challenge in delivering HAS-based live streams with low latency and high quality. Any problem in the delivery paths will result in a reduced viewer experience. In this paper, we propose \mathsfHxL3, a novel architecture for low-latency live streaming. \mathsfHxL3 is agnostic to the protocol and codecs that can work equally with existing HAS-based approaches. By holding the minimum number of live media segments through efficient caching and prefetching policies at the edge, improved transmissions, as well as transcoding capabilities, \mathsfHxL3 is able to achieve high viewer experiences across the Internet by alleviating rebuffering and substantially reducing initial startup delay and live stream latency. \mathsfHxL3 can be easily deployed and used. Its performance has been evaluated using real live stream sources and entities that are distributed worldwide. Experimental results show the superiority of the proposed architecture and give good insights into how low latency live streaming is working.

Abstract:
The objectives of this work are cross-modal text-audio and audio-text retrieval, in which the goal is to retrieve the audio content from a pool of candidates that best matches a given written description and vice versa. Text-audio retrieval enables users to search large databases through an intuitive interface: they simply issue free-form natural language descriptions of the sound they would like to hear. To study the tasks of text-audio and audio-text retrieval, which have received limited attention in the existing literature, we introduce three challenging new benchmarks. We first construct text-audio and audio-text retrieval benchmarks from the AudioCaps and Clotho audio captioning datasets. Additionally, we introduce the SoundDescs benchmark, which consists of paired audio and natural language descriptions for a diverse collection of sounds that are complementary to those found in AudioCaps and Clotho. We employ these three benchmarks to establish baselines for cross-modal text-audio and audio-text retrieval, where we demonstrate the benefits of pre-training on diverse audio tasks. We hope that our benchmarks will inspire further research into audio retrieval with free-form text queries.

Abstract:
Personalized micro-video recommendation has attracted a lot of research attention with the growing popularity of micro-video sharing platforms. Many efforts have been made to consider micro-video recommendation as a matching task and shown promising performance, while they only focus on simple features or multi-modal attribute information. Recently, Graph Neural Networks (GNNs) have been employed in many recommendation tasks and achieved impressive success. However, these GNN-based methods may suffer from the following limitations: (1) fail to capture the heterogeneity of nodes in user-video bipartite graphs; (2) ignore the non-local (global) semantic correlation information remained in heterogeneous graphs. In this paper, we present a novel approach, Heterogeneous Graph Contrastive Learning Network (HGCL), for personalized micro-video recommendation. To consider heterogeneity in user-video bipartite graphs, we first introduce a heterogeneous graph encoder network for a high-quality representation learning of users and micro-videos. Specifically, we design a random surfing model to generate node-type specific homogeneous graphs to preserve the heterogeneity. Then we propose a graph contrastive learning framework to achieve representation learning on each node-type specific homogeneous graph by maximizing the mutual information between local patches of a graph and the global representation of the entire graph. Finally, a type-crossing objective function is proposed to jointly integrate the node embeddings from different node types to facilitate high-quality representation learning. Experimental results on real-world datasets in the micro-video recommendation task validate the performance of our method, compared with state-of-the-art baseline algorithms.

Abstract:
Weakly supervised salient object detection (WSOD) aims at training saliency detection models with weak supervision. Normally, the WSOD methods use pseudo labels converted from image-level classification labels to train the saliency network. However, the converted pseudo labels always contain noise information compared to ground truth. Previous methods are directly affected by pseudo label noise to generate error-prone predictions. To mitigate this problem, we design a noise-robust adversarial learning framework and propose a noise-sensitive training strategy for the framework. The framework consists of a saliency network and a noise-robust discriminator network. With the guidance of noise-robust discriminator network, our saliency network is robust to noise information in pseudo labels. The proposed noise-sensitive training strategy can make good use of both superior and inferior samples in the pseudo label dataset. With the noise-sensitive training strategy, our framework can further balance the learning of saliency information and the robustness of noise information. Comprehensive experiments on five public datasets demonstrate that our method outperforms the existing image-level classification label based WSOD methods.

Abstract:
Volumetric video enables a six-degree-of-freedom (6DoF) immersive viewing experience and has a wide range of applications in entertainment and education, among others. Most existing approaches to volumetric video streaming are extensions of VR video streaming solutions that do not take into account user behavior and the properties of the video during the tiling process, and the complexity of decoding is high. To this end, we study volumetric video streaming in this paper and address the research questions mentioned above. In particular, we first propose a hybrid visual saliency and hierarchical clustering empowered 3D tiling scheme that better matches the user’s field of view (FoV). Then, we build a quality of experience (QoE) model considering the volumetric video features as the optimization objective. In addition to the usual encoded version, we introduce the reconstructed version (i.e., decoded version, which allows the user to skip the decoding process and thus reduces the decoding overhead) and propose a joint computational and communication resource allocation scheme to achieve a trade-off between communication and computational resources to maximize the QoE. We perform exhaustive simulations and build a prototype system to verify the performance of the proposed tiling and transmission scheme. The results show that the proposed tiling and transmission scheme performs significantly better than the comparison schemes.

Abstract:
A high-quality image description requires not only the logic and fluency of language but also the richness and accuracy ofcontent. However, due to the semantic gap between vision and language, most existing image captioning approaches thatdirectly learn the cross-modal mapping from vision to language are difficult to meet these two requirements simultaneously. Inspired by the progressive learning mechanism, we trace the “generating + refining” route and propose a novel Text-GuidedGeneration and Refinement (dubbed as TGGAR) model with assistance from the guide text to improve the quality of captions.The guide text is selected from the training set according to content similarity, then utilized to explore salient objects andextend candidate words. Specifically, we follow the encoderdecoder architecture, and design a Text-Guided Relation Encoder(TGRE) to learn the visual representation that is more consistent with human visual cognition. Besides, we divide the decoderpart into two sub-modules: a Generator for the primary sentence generation and a Refiner for the sentence refinement.Generator, consisting of a standard LSTM and a Gate on Attention (GOA) module, aims to generate the primary sentencelogically and fluently. Refiner contains a caption encoder module, an attentionbased LSTM and a GOA module, whichiteratively modifies the details in the primary caption to make captions rich and accurate. Extensive experiments on theMSCOCO captioning dataset demonstrate our framework with fewer parameters remains comparable to transformer-basedmethods, and achieves state-of-the-art performance compared with other relevant approaches.

Abstract:
Point cloud is a major representation format of 3D objects and scenes. It has been increasingly applied in various applications due to the rapid advances in 3D sensing and rendering technologies. In the field of autonomous driving, point clouds captured by spinning Light Detection And Ranging (LiDAR) devices have become an informative data source for road environment perception and intelligent vehicle control. On the other hand, the massive data volume of point clouds also brings huge challenges to point cloud transmission and storage. Therefore, establishing compression frameworks and algorithms that conform to the characteristics of point cloud data has become an important research topic for both academia and industry. In this paper, a geometry compression method dedicated to spinning LiDAR point cloud was proposed taking advantage of the prior information of the LiDAR acquisition procedure. Rate-distortion optimizations were further integrated into the coding pipeline according to the characteristics of the prediction residuals. Experimental results obtained on different datasets show that the proposed method consistently outperforms the state-of-the-art G-PCC predictive geometry coding method with reduced runtime at both the encoder and decoder sides.

Abstract:
Existing methods detect the keypoints in a non-differentiable way, therefore they can not directly optimize the position of keypoints through back-propagation. To address this issue, we present a partially differentiable keypoint detection module, which outputs accurate sub-pixel keypoints. The reprojection loss is then proposed to directly optimize these sub-pixel keypoints, and the dispersity peak loss is presented for accurate keypoints regularization. We also extract the descriptors in a sub-pixel way, and they are trained with the stable neural reprojection error loss. Moreover, a lightweight network is designed for keypoint detection and descriptor extraction, which can run at 95 frames per second for 640× 480 images on a commercial GPU. On homography estimation, camera pose estimation, and visual (re-)localization tasks, the proposed method achieves equivalent performance with the state-of-the-art approaches, while greatly reduces the inference time.

Abstract:
Recently, there has been an increasing interest in image editing methods that employ pre-trained unconditional image generators (e.g., StyleGAN). However, applying these methods to translate images to multiple visual domains remains challenging. Existing works do not often preserve the domain-invariant part of the image (e.g., the identity in human face translations), or they do not usually handle multiple domains or allow for multi-modal translations. This work proposes an implicit style function (ISF) to straightforwardly achieve multi-modal and multi-domain image-to-image translation from pre-trained unconditional generators. The ISF manipulates the semantics of a latent code to ensure that the image generated from the manipulated code lies in the desired visual domain. Our human faces and animal image manipulations show significantly improved results over the baselines. Our model enables cost-effective multi-modal unsupervised image-to-image translations at high resolution using pre-trained unconditional GANs. The code and data are available at: https://github.com/yhlleo/stylegan-mmuit.

Abstract:
Multispectral pedestrian detection is an important and valuable task in many applications, which could provide a more accurate and reliable pedestrian detection result by using the complementary visual information from color and thermal images. However, it faces two open and difficult challenges: 1) how to effectively and dynamically integrate multispectral information according to the confidence of different modalities, and 2) how to produce a reliable prediction result. In this paper, we propose a novel confidence-aware multispectral pedestrian detection (CMPD) method, which flexibly learns the multispectral representation while simultaneously producing a reliable result with confidence estimation. Specifically, a dense fusion strategy is first proposed to extract the multilevel multispectral representation at the feature level. Then, an additional confidence subnetwork is utilized to dynamically estimate the detection confidence for each modality. Finally, Dempster's combination rule is introduced to fuse the results of different branches according to the rectified confidence. Our proposed CMPD method not only effectively integrates multimodal information but also provides a reliable prediction. Extensive experimental results demonstrate the efficiency of our algorithm compared with state-of-the-art methods.

Abstract:
This paper focuses on a new problem of estimating human pose and shape from single polarization images. Polarization camera is known to be able to capture the polarization of reflected lights that preserves rich geometric cues of an object surface. Inspired by the recent applications in surface normal reconstruction from polarization images, in this paper, we attempt to estimate human pose and shape from single polarization images by leveraging the polarization-induced geometric cues. A dedicated two-stage pipeline is proposed: given a single polarization image, stage one (Polar2Normal) focuses on the fine detailed human body surface normal estimation; stage two (Polar2Shape) then reconstructs clothed human shape from the polarization image and the estimated surface normal. To empirically validate our approach, a dedicated dataset (PHSPD) is constructed, consisting of over 500 K frames with accurate pose and parametric shape annotations. Empirical evaluations on this real-world dataset as well as a synthetic dataset, SURREAL, demonstrate the effectiveness of our approach. It suggests polarization camera as a promising alternative to the more conventional RGB camera for human pose and shape estimation.

Abstract:
Captured images of outdoor scenes usually exhibit low visibility in cases of severe haze, which interferes with optical imaging and degrades image quality. Most of the existing methods solve the single-image dehazing problem by applying supervised training on paired images; however, in practice, the pairing of real-world images is not viable. Additionally, the processing speed of individual dehazing models is important in practical applications. In this study, a novel unsupervised single image dehazing network (USID-Net) based on disentangled representations without paired training images is explored. Furthermore, considering the trade-off between performance and memory storage, a compact multi-scale feature attention (MFA) module is developed, integrating multi-scale feature representation and attention mechanism to facilitate feature representation. To effectively extract haze information, a mechanism referred to as OctEncoder is designed to include multi-frequency representations that can capture more global information. Extensive experiments show that USID-Net achieves competitive dehazing results and a relatively high processing speed compared to state-of-the-art methods. The source code is available at https://github.com/dehazing/USID-Net.

Abstract:
The weakly supervised Temporal Action Detection (TAD) by using the video-level annotations can lighten the burden of labor consumption. However, the current methods for weakly supervised TAD do not take full advantage of the short-term consistency between consecutive frames and the long-term continuity inside an action, resulting in less accurate detecting boundaries of actions in untrimmed videos. In this paper, the SuperFrame-based Temporal Proposal (SFTP) is proposed, in which superframes are formed for representing a series of consecutive frames with high temporal consistency and their features are pooled from the features of frames through the integration function. Then, the temporal proposal is built based on the multiple consecutive superframes and the features of all proposals are generated from a pyramidal feature hierarchy. This hierarchy consists of the designed Structured Outer-Inner Context (SOIC) features formed from superframe features and is able to explicitly characterize the temporal continuity inside a proposal. Furthermore, a novel Scale-Wise Normalization Strategy (SWNS) is proposed to identify proposals, which can effectively detect multiple actions with different duration in one untrimmed video. Extensive experiments are conducted on two public datasets: THUMOS14 and ActivityNet1.2 for performance evaluation. Our experimental results have demonstrated that the proposed approach is able to detect the boundaries of actions more effectively and obtain competitive mAP (mean average precision) compared with other approaches.

Abstract:
To deal with the challenges in video object detection (VOD), such as occlusion and motion blur, many state-of-the-art video object detectors adopt a feature aggregation module to encode the long-range contextual information to support the current frame. The main drawbacks of these detectors are three-folds: first, the frame-wise detection slows down the detection speed; second, the frame-wise detection usually ignores the local continuity of the objects in a video, resulting in temporal inconsistent detection; third, the feature aggregation module usually encodes temporal features either from a local video clip or a single video, without exploiting the features in other videos. In this work, we develop an online VOD algorithm, aiming at a balanced high-speed and high-accuracy, by exploiting the global memory and local continuity. In the algorithm, an effective and efficient global memory bank (GMB) is designed to deposit and update object class features, which enables us to exploit the support features in other videos to enhance object features in the current video frames. Besides, to further speed up the detection, we design an object tracker to perform object detection for non-key frames based on the detection results of the key frame by leveraging the local continuity property of the video. Considering the trade-off between detection accuracy and speed, the proposed framework achieves superior performance on the ImageNet VID dataset. Source codes will be released to the public via our GitHub website.

Abstract:
Multisensory systems provide complementary information that aids many machine learning approaches in perceiving the environment comprehensively. These systems consist of heterogeneous modalities, which have disparate characteristics and feature distributions. Thus, extracting, aligning, and fusing complementary representations from heterogeneous modalities (e.g., visual, skeleton, and physical sensors) remains challenging. To address these challenges, we have used the insights from several neuroscience studies of animal multisensory systems to develop MAVEN, a memory-augmented recurrent approach for multimodal fusion. MAVEN generates unimodal memory banks comprised of spatial-temporal features and uses our proposed recurrent representation alignment approach to align and refine unimodal representations iteratively. MAVEN then utilizes a multimodal variational attention-based fusion approach to produce a robust multimodal representation from the aligned unimodal features. Our extensive experimental evaluations on three multimodal datasets suggest that MAVEN outperforms state-of-the-art multimodal learning approaches in the challenging human activity recognition task across all evaluation conditions (cross-subject, leave-one-subject-out, and cross-session). Additionally, our extensive ablation studies suggest that MAVEN significantly outperforms the feed-forward fusion-based learning models (p< 0.05). Finally, the robust performance of MAVEN in extracting complementary multimodal representation from occluded and noisy data suggests its applicability on real-world datasets.

Abstract:
The appearances of children are inherited from their parents, which makes it feasible to predict them. Predicting realistic children's faces may help settle many social problems, such as age-invariant face recognition, kinship verification, and missing child identification. It can be regarded as an image-to-image translation task. Existing approaches usually assume domain information in the image-to-image translation can be interpreted by “style”, i.e., the separation of image content and style. However, such separation is improper for the child face prediction, because the facial contours between children and parents are not the same. To address this issue, we propose a new disentangled learning strategy for children's face prediction. We assume that children's faces are determined by genetic factors (compact family features, e.g., face contour), external factors (facial attributes irrelevant to prediction, such as moustaches and glasses), and variety factors (individual properties for each child). On this basis, we formulate predictions as a mapping from parents’ genetic factors to children's genetic factors, and disentangle them from external and variety factors. In order to obtain accurate genetic factors and perform the mapping, we propose a ChildPredictor framework. It transfers human faces to genetic factors by encoders and back by generators. Then, it learns the relationship between the genetic factors of parents and children through a mapping function. To ensure the generated faces are realistic, we collect a large Family Face Database to train ChildPredictor and evaluate it on the FF-Database validation set. Experimental results demonstrate that ChildPredictor is superior to other well-known image-to-image translation methods in predicting realistic and diverse child faces. Implementation codes can be found at https://github.com/zhaoyuzhi/ChildPredictor.

Abstract:
Music generation task is commonly considered as a note-by-note prediction problem. Moreover, prediction models generating one musical note at a time may ignore the overall coherence because the music phrase is incomplete and unable to demonstrate musicality. To address these issues, in this study, we propose a feasible monophonic music generation framework that can simulate subsequent trends for each predicted musical note. The framework generates a musical note mainly in three steps: 1) a sequence prediction model is used to predict the most potential candidates, 2) the subsequent trends for each candidate are modeled and evaluated, and 3) the best candidate is selected as the final result. We use the Monte-Carlo tree search algorithm because of its great capability of discovering near-optimal results. We establish a method of training a value network that can assess musical coherence to evaluate the simulated sequences. Further, we used a smoothed polynomial upper confidence trees algorithm to improve the accuracy and efficiency of the search process. An accurate dataset labeled by us, which contains 36 transcribed samples from real-world pop songs, was used to validate our framework. Compared with the note-by-note sequence prediction model, our framework exhibits a better sense of musicality. Our framework can be applied to generate symbolic monophonic music, particularly the main melody track in pop music.

Abstract:
Fine-grained emotion recognition can model the temporal dynamics of emotions, which is more precise than predicting one emotion retrospectively for an activity (e.g., video clip watching). Previous works require large amounts of continuously annotated data to train an accurate recognition model, however experiments to collect such large amounts of continuously annotated physiological signals are costly and time-consuming. To overcome this challenge, we propose an Emotion recognition algorithm based on Deep Siamese Networks (EmoDSN) which can rapidly converge on a small amount of training data, typically less than 10 samples per class (i.e., <10 shot). EmoDSN recognizes fine-grained valence and arousal (V-A) labels by maximizing the distance metric between signal segments with different V-A labels. We tested EmoDSN on three different datasets collected in three different environments: desktop, mobile and HMD-based virtual reality, respectively. The results from our experiments show that EmoDSN achieves promising results for both one-dimension binary (high/low V-A, 1D-2 C) and two-dimensional 5-class (four quadrants of V- A space + neutral, 2D-5 C) classification. We get an averaged accuracy of 76.04, 76.62 and 57.62% for 1D-2 C valence, 1D-2 C arousal, and 2D-5 C, respectively, by using only 5 shots of training data. Our experiments show that EmoDSN can achieve better results if we select training samples from the changing points of emotion or the ending moments of video watching.

Abstract:
Automatic Photo Selection (APS) is a fundamental and important task for further photo cropping and photo enhancement. As the images in a photo series normally have subtle differences, it remains challenging to surface the best photos among highly similar photos. In this work, we propose a Recursive Multi-Relational Graph Convolutional Network (RMGCN) for APS. Specifically, we explore and devise inner-relation and inter-relation graphs to learn informative representations in hierarchical manner. 1) Patch-aware Intra Graph Module (PIGM) captures visual and spatial relations between different patches to characterize the representations in an image. 2) Context-aware Inter Graph Module (CIGM) explicitly exploits mutual comparative relation between different images in a photo series. These two graphs are recursively refined each other by reasoning the graph representations. Then, our model aggregates the output of CIGM with multi-scale local features via the proposed Cross-domain Fusing Gate (CFG) to boost the discriminative ability. Besides, we formulate four companion objectives as soft constraints to improve convergence rate during training. Extensive experiments are conducted on photo-triage dataset, and superior results are reported on different metrics when comparing to the state-of-the-art methods. We also perform rigorous ablations and analysis to validate our approach.

Abstract:
Over the past years, semantic segmentation, similar to many other tasks in computer vision, has benefited from the progress in deep neural networks, resulting in significantly improved performance. However, deep architectures trained with gradient-based techniques suffer from catastrophic forgetting, which is the tendency to forget previously learned knowledge while learning new tasks. Aiming at devising strategies to counteract this effect, incremental learning approaches have gained popularity over the past years. However, the first incremental learning methods for semantic segmentation appeared only recently. While effective, these approaches do not account for a crucial aspect in pixel-level dense prediction problems, i.e., the role of attention mechanisms. To fill this gap, in this paper, we introduce a novel attentive feature distillation approach to mitigate catastrophic forgetting while accounting for semantic spatial- and channel-level dependencies. Furthermore, we propose a continual attentive fusion structure, which takes advantage of the attention learned from the new and the old tasks while learning features for the new task. Finally, we also introduce a novel strategy to account for the background class in the distillation loss, thus preventing biased predictions. We demonstrate the effectiveness of our approach with an extensive evaluation on Pascal-VOC 2012 and ADE20 K, setting a new state of theart.

Abstract:
The state-of-the-art compression method for Light Detection And Ranging (LiDAR) point clouds is the geometry-based point cloud compression (G-PCC) standard developed by Moving Pictures Experts Group immersive media working group (MPEG-I). However, there are currently no rate control algorithms designed specifically for Geometry-based LiDAR point cloud compression (G-LPCC). In this paper, we propose the first frame-level rate control algorithm for G-LPCC. We mainly have the following contributions in our proposed rate control algorithm. First, we model the rate-distortion (R-D) relationship for both the geometry and attribute. As the geometry bitrate is mainly determined by the frame-level geometry quantizer Q_G, we propose a relationship between the geometry bitrate and Q_G. In addition, as the attribute bitrate can be influenced by both the attribute quantizer Q_A and Q_G, we build a relationship among the attribute bitrate, Q_G, and Q_A. Second, we propose a bit allocation algorithm between the geometry and attribute based on the R-D modeling. The Q_G and Q_A are modeled into a proper relationship to obtain geometry and attribute bits to achieve good R-D performance. Third, we propose using the point density of LiDAR point clouds to estimate the geometry model parameters. The point density is calculated using the average distance between each point and its nearest neighbor after excluding some noisy points. The proposed rate control algorithm is implemented in the G-PCC reference software. The experimental results show that the proposed rate control algorithm can control the bitrate accurately with satisfactory R-D performance.

Abstract:
Although deep learning methods have drastically improved the performance on visual recognition tasks in which large inter-class variances exist, similar-class recognition continues to pose significant challenges, mainly due to the close resemblance between similar classes. The challenge is further compounded in the case of few-shot learning because only a very small amount of training data is available; accordingly, a certain performance degradation has been observed when some few-shot methods are applied for classification tasks. To address the aforementioned issue, we propose a novel Relation Separation Network (RSNet) in this paper, aiming to boost few-shot learning by improving similar-class recognition performance. We assume that image features consist of common and private features, where the common features capture the basic attributes shared among similar classes and their private counterparts capture the unique attributes of each class. Our RSNet learns to decouple the common and private features of an image. As a result, the feature representation of an image is composed of two weakly associated but easily aligned components, and better classification performance is achieved by giving more attention to subtle features. Experimental results on the publicly available datasets miniImageNet, CUB, and CIFAR-FS show that the proposed model outperforms existing state-of-the-art methods. Specifically, compared to PT+MAP, RSNet improves the accuracy of classification on the CUB dataset by approximately 5% and that of similar-class classification by more than 10%.

Abstract:
Video moment retrieval, i.e., localizing the specific video moments within a video given a description query, has attracted substantial attention over the past several years. Although great progress has been achieved thus far, most of existing methods are supervised, which require moment-level temporal annotation information. In contrast, weakly-supervised methods which only need video-level annotations remain largely unexplored. In this paper, we propose a novel end-to-end Siamese alignment network for weakly-supervised video moment retrieval. To be specific, we design a multi-scale Siamese module, which could progressively reduce the semantic gap between the visual and textual modality with the Siamese structure. In addition, we present a context-aware multiple instance learning module by considering the influence of adjacent contexts, enhancing the moment-query and video-query alignment simultaneously. By promoting the matching of both moment-level and video-level, our model can effectively improve the retrieval performance, even if only having weak video level annotations. Extensive experiments on two benchmark datasets, i.e., ActivityNet-Captions and Charades-STA, verify the superiority of our model compared with several state-of-the-art baselines.

Abstract:
In online games, user profiling plays a vital role in a variety of personalized services. Current solutions typically treat different dimensions or labels (e.g., willing to pay or not, high, medium, or low appetite for some gameplays) of the full user profiles as independent multi-class/binary classification tasks. However, such a one-by-one profiling strategy clearly overlooks the implicit correlations among profiling tasks, which results in degraded performance. To cope with this issue, we make the first attempt to formalize this problem as a multi-label learning task. Accordingly, we develop a unified Multi-Source Multi-Label learning framework (MSML) that well utilizes semantically rich features and labels for boosted user profiling in online games. Specifically, we first introduce a multi-source user representation network that exploits multi-source data in online games to obtain informative user representations. Subsequently, to handle multiple labels, we propose a novel embedding-based multi-label network that consists of two variational autoencoders with disentangled latent spaces. Note that our framework can guarantee the consistency of the training and testing phases by a novel dual-tower design to overcome the limitation of existing approaches that use one coupled decoder for both features and labels. Extensive experiments on six public multi-label datasets and one real-world online game dataset from Justice demonstrate that the proposed framework outperforms the state-of-the-art baseline methods. Moreover, our proposed framework has been successfully deployed in several online games, yielding a significant boost in multi-label user profiling.

Abstract:
Salient object detection (SOD) in complex scenes and environments is a challenging research topic. Most works focus on RGB-based SOD, which limits its performance of real-life applications when confronted with adverse conditions such as dark environments and complex backgrounds. Since thermal infrared spectrum provides the complementary information, RGBT SOD has become a new research direction. However, current research for RGBT SOD is limited by the lack of a large-scale dataset and comprehensive benchmark. This work contributes such a RGBT image dataset named VT5000, including 5000 spatially aligned RGBT image pairs with ground truth annotations. VT5000 has 11 challenges collected in different scenes and environments for exploring the robustness of algorithms. With this dataset, we propose a powerful baseline approach, which extracts multilevel features of each modality and aggregates these features of all modalities with the attention mechanism for accurate RGBT SOD. To further solve the problem of blur boundaries of salient objects, we also use an edge loss to refine the boundaries. Extensive experiments show that the proposed baseline approach outperforms the state-of-the-art methods on VT5000 dataset and other two public datasets. In addition, we carry out a comprehensive analysis of different algorithms of RGBT SOD on VT5000 dataset, and then make several valuable conclusions and provide some potential research directions for RGBT SOD.

Abstract:
Bottom-up text detection methods play an important role in arbitrary-shape scene text detection but there are two restrictions preventing them from achieving their great potential, i.e., 1) the accumulation of false text segment detections, which affects subsequent processing, and 2) the difficulty of building reliable connections between text segments. Targeting these two problems, we propose a novel approach, named “MorphText,” to capture the regularity of texts by embedding deep morphology for arbitrary-shape text detection. Towards this end, two deep morphological modules are designed to regularize text segments and determine the linkage between them. First, a Deep Morphological Opening (DMOP) module is constructed to remove false text segment detections generated in the feature extraction process. Then, a Deep Morphological Closing (DMCL) module is proposed to allow text instances of various shapes to stretch their morphology along their most significant orientation while deriving their connections. Extensive experiments conducted on four challenging benchmark datasets (CTW1500, Total-Text, MSRA-TD500 and ICDAR2017) demonstrate that our proposed MorphText outperforms both top-down and bottom-up state-of-the-art arbitrary-shape scene text detection approaches.

Abstract:
Few-shot semantic segmentation aims to segment novel-class objects in a given query image with only a few labeled support images. Most advanced solutions exploit a metric learning framework that performs segmentation through matching each query feature to a learned class-specific prototype. However, this framework suffers from biased classification due to incomplete feature comparisons. To address this issue, we present an adaptive prototype representation by introducing class-specific and class-agnostic prototypes and thus construct complete sample pairs for learning semantic alignment with query features. The complementary features learning manner effectively enriches feature comparison and helps yield an unbiased segmentation model in the few-shot setting. It is implemented with a two-branch end-to-end network (i.e., a class-specific branch and a class-agnostic branch), which generates prototypes and then combines query features to perform comparisons. In addition, the proposed class-agnostic branch is simple yet effective. In practice, it can adaptively generate multiple class-agnostic prototypes for query images and learn feature alignment in a self-contrastive manner. Extensive experiments on PASCAL-5^i and COCO-20^i demonstrate the superiority of our method. At no expense of inference efficiency, our model achieves state-of-the-art results in both 1-shot and 5-shot settings for semantic segmentation.

Abstract:
Since deep convolutional neural network (CNN) has achieved excellent results in single image super-resolution (SISR), an increasing number of methods based on CNN have been proposed. Most CNN-based methods are devoted to finding mapping based on pixel intensity while ignoring the importance of frequency information, which can reflect semantic information of images on different bands. This leads to less effectiveness in the reconstruction of high-frequency details. To address this problem, we propose a novel CNN-based super-resolution method named joint wavelet sub-bands guided network (JWSGN). We separate the different frequency information of the image by the WT and then recover this information by a multi-branch network. To recover finer edge details, we propose an edge extraction module, which estimates an edge feature map by using the similarity of all high-frequency sub-bands and then corrects the high-frequency features recovered from each branch by exploiting the edge feature map. Furthermore, we use the complementary relationship between different frequencies to calibrate the high-frequency sub-bands. Finally, the high-resolution image is obtained by inverse wavelet transform. Both qualitative and quantitative experiments show that our method performs excellent performance with the guidance of the edge extraction module.

Abstract:
Photo retouching aims at improving the aesthetic visual quality of images that suffer from photographic defects, especially for poor contrast, over/under exposure, and inharmonious saturation. In practice, photo retouching can be accomplished by a series of image processing operations. As most commonly-used retouching operations are pixel-independent, i.e., the manipulation on one pixel is uncorrelated with its neighboring pixels, we can take advantage of this property and design a specialized algorithm for efficient global photo retouching. We analyze these global operations and find that they can be mathematically formulated by a Multi-Layer Perceptron (MLP). Based on this observation, we propose an extremely lightweight framework – Conditional Sequential Retouching Network (CSRNet). Benefiting from the utilization of 1× 1 convolution, CSRNet only contains less than 37 K trainable parameters, which are orders of magnitude smaller than existing learning-based methods. Experiments show that our method achieves state-of-the-art performance on the benchmark MIT-Adobe FiveK dataset quantitively and qualitatively. In addition to achieve global photo retouching, the proposed framework can be easily extended to learn local enhancement effects. The extended model, namely CSRNet-L, also achieves competitive results in various local enhancement tasks.

Abstract:
One-stage space-time video super-resolution (STVSR) aims to directly reconstruct high-resolution (HR) and high frame rate (HFR) video from its low-resolution (LR) and low frame rate (LFR) counterpart. Due to the wide application, one-stage STVSR has drawn much attention recently. However, existing one-stage methods suffer from ineffective exploration of the auxiliary information from adjacent time steps that may be useful to STVSR at the current time step. To address this issue, we propose a novel Bidirectional Recurrent Space-Time Upsampling network called Bi-RSTU for one-stage STVSR to utilize auxiliary information at various time steps. Specifically, an efficient channel attention feature interpolation (ECAFI) module is devised to synthesize the intermediate frame’s LR feature by exploiting its two neighboring LR video frame features. Subsequently, we fuse the information from the previous time step into these intermediate and neighboring features. Finally, second-order attention spindle (SOAS) blocks are stacked to form the feature reconstruction module that learns a mapping from LR fused feature space to HR feature space. Experimental results on public datasets demonstrate that our Bi-RSTU shows competitive performance compared with current two-stage and one-stage state-of-the-art STVSR methods.

Abstract:
Person re-identification (Re-ID) has achieved great success in the supervised scenario. However, it is difficult to directly transfer the supervised model to arbitrary unseen domains due to the model overfitting to the seen source domains. In this paper, we aim to tackle the generalizable multi-source person Re-ID task (i.e., there are multiple available source domains, and the testing domain is unseen during training) from the data augmentation perspective, thus we put forward a novel method, termed MixNorm. It consists of domain-aware mix-normalization (DMN) and domain-aware center regularization (DCR). Different from the conventional data augmentation, the proposed domain-aware mix-normalization enhances the diversity of features during training from the normalization perspective of the neural network, which can effectively alleviate the model overfitting to the source domains, so as to boost the generalization capability of the model in the unseen domain. To further promote the efficacy of the proposed DMN, we exploit the domain-aware center regularization to better map the diversely generated features into the same space. Extensive experiments on multiple benchmark datasets validate the effectiveness of the proposed method and show that the proposed method can outperform the state-of-the-art methods. Besides, further analysis also reveals the superiority of the proposed method.

Abstract:
Arbitrary-shaped scene text detection is a challenging task due to the variety of text changes in font, size, color, and orientation. Most existing regression based methods resort to regress the masks or contour points of text regions to model the text instances. However, regressing the complete masks requires high training complexity, and contour points are not sufficient to capture the details of highly curved texts. To tackle the above limitations, we propose a novel light-weight anchor-free text detection framework called TextDCT, which adopts the discrete cosine transform (DCT) to encode the text masks as compact vectors. Further, considering the imbalanced number of training samples among pyramid layers, we only employ a single-level head for top-down prediction. To model the multi-scale texts in a single-level head, we introduce a novel positive sampling strategy by treating the shrunk text region as positive samples, and design a feature awareness module (FAM) for spatial-awareness and scale-awareness by fusing rich contextual information and focusing on more significant features. Moreover, we propose a segmented non-maximum suppression (S-NMS) method that can filter low-quality mask regressions. Extensive experiments are conducted on four challenging datasets, which demonstrate our TextDCT obtains competitive performance on both accuracy and efficiency. Specifically, TextDCT achieves F-measure of 85.1 at 17.2 frames per second (FPS) and F-measure of 84.9 at 15.1 FPS for CTW1500 and Total-Text datasets, respectively.

Abstract:
The multimedia has achieved dominant positions in both local storage and internet bandwidth, which inevitably promotes the compression of audio, image and video information. Nowadays, the emerging haptic technology, which enhances the immersion in virtual reality and remote control, has also brought new challenges in its codec design. It is thus imperative to develop haptic codecs, including kinesthetic and vibrotactile codecs, with high efficiency and low delay. In this paper, we exploit statistical features of vibrotactile data to develop a Recurrent-Network-based Vibrotactile Codec (RNVC) with high compression efficiency and low coding delay. The proposed encoder consists of vibrotactile estimation by Gate Recurrent Unit (GRU), non-uniform quantization/compensation of residuals and an entropy encoder. In particular, the GRU-based recurrent network is utilized for its high efficiency to predict signals and low complexity to converge. The decoder consists of all counterparts of encoder. Experimental results show the proposed RNVC significantly reduces of original bitrates with negligible encoding delay, which achieves the state-of-the-art coding performance of vibrotactile signal.

Abstract:
We tackle the challenging task of few-shot segmentation in this work. It is essential for few-shot semantic segmentation to fully utilize the support information. Previous methods typically adopt masked average pooling over the support feature to extract the support clues as a global vector, usually dominated by the salient part and lost certain essential clues. In this work, we argue that every support pixel’s information is desired to be transferred to all query pixels and propose a Correspondence Matching Network (CMNet) with an Optimal Transport Matching module to mine out the correspondence between the query and support images. Besides, it is critical to fully utilize both local and global information from the annotated support images. To this end, we propose a Message Flow module to propagate the message along the inner-flow inside the same image and cross-flow between support and query images, which greatly helps enhance the local feature representations. Experiments on PASCAL VOC 2012, MS COCO, and FSS-1000 datasets show that our network achieves new state-of-the-art few-shot segmentation performance.

Abstract:
Bullet-time videos have been widely used in movies, TV advertisements, and computer games, and can produce an immersive and smooth orbital free-viewpoint of frozen action. However, existing bullet-time video synthesis methods remain challenging in practical applications, especially in complex situations with poor camera calibration and a variety of camera array structures. This paper proposes a novel bullet-time video synthesis method based on a virtual dynamic target axis. We adopt an image similarity transformation strategy to eliminate image distortion in the bullet-time video. We use a high-order polynomial curve fitting strategy to reserve more bullet-time video frame content. The proposed dynamic target axis strategy can support various camera array structures, including camera arrays with and without a common field of view. In addition, this strategy can also tolerate poor camera calibration situations with unevenly distributed reprojection errors to some extent and synthesize smooth bullet-time videos without high-precision camera calibration. Qualitative and quantitative experiments in real environments and on simulation platforms demonstrate the high performance of our bullet-time video synthesis method. Compared with the state-of-the-art methods, the proposed method shows superiority.

Abstract:
This work addresses the task of action recognition in video sequences. In real world applications, this task is quite challenging due to the complex background of video content, the similarities between different types of actions, the dependence on a large amount of annotated data, and so on. Most of the existing methods fail to distinguish similar actions with the same static appearance and motion pattern. We attempt to address this issue from the perspective of a local-global view, considering videos as combinations of a set of action units (local semantic information) and their relations along temporal dimension (global relation information). To achieve this end, we propose a novel Local-global Networks (LgNet) to enhance recognition of similar action. Besides, we propose an end-to-end training method to decrease the reliance on annotated data. It combines self-supervised learning and supervised learning, which not only enables the model to learn video representations from a large number unannotated data but also avoids subsequent finetuning. The proposed training method can be flexibly equipped to a wide array of vision tasks. Experiments on several benchmark datasets show that our proposed model and training method achieve state-of-the-art performance.

Abstract:
Cross-Modal Zero-Shot Hashing (CMZSH) is an important image retrieval technique, e.g., Text Based Image Retrieval. Most of existing CMZSH methods mainly use semantic attributes as guidance to generate hash codes for both the images and texts of seen and unseen categories. However, existing CMZSH methods only focus on learning global attribute vectors and hash codes for images, which mixes up information of complex semantics and background clutters, and thus impedes the retrieval performance. To solve this issue, we propose an Attribute-Guided Multiple Instance Hashing (AG-MIH) network for CMZSH, where each instance represents one image region. Instead of generating global image hash codes, AG-MIH can effectively learn instance-level hash codes based on instance attributes. To improve the attribute learning for instances, AG-MIH can exploia novel 2-D Category-Attribute Relation (CAR) layer, which uses different matching templates to model the relationships between each instance and the attributes for different categories. Under the guidance of semantic attributes, AG-MIH can effectively learn hash codes for each visual instance and texts by a Multi-stream Instance Hashing Refinement (MIHR) procedure. In the MIHR, the pseudo supervisions for the instance-level attributes and hash codes in each stream are from its proceeding stream. Empirical studies on benchmark datasets show that AG-MIH achieves state-of-the-art performance on both cross-modal and single-modal zero-shot image retrieval tasks.

Abstract:
Scene Graph Generation (SGG) is to abstract the objects and their semantic relationships within a given image. Current SGG performance is mainly limited by the biased predicate prediction caused by the long-tailed data distribution. Though many unbiased SGG methods have emerged to enhance the prediction of the tail predicates, their improvements on the tail predicates are often accompanied by the deterioration on the head ones, leading the prediction overly debiased. Toward this end, in this work, we propose a Dual-Biased Predicate Predictor (DBiased-P) to boost the unbiased SGG, which comprises a re-weighted primary classifier and an unweighted auxiliary classifier. The former classifier is tail-biased and used for the final predicate prediction, while the latter one is head-biased and designed to boost the head predicate prediction of the primary classifier by a head-oriented soft regularization. Experiments conducted on Visual Genome and Open Image datasets indicate the superiority of our DBiased-P in unbiased SGG, which significantly improves the recall@50 of the state-of-the-art unbiased SGG method DT2-ACBS from 23.3% to 55.5% as well as the mean recall@50 from 35.9% to 37.7%.

Abstract:
Generalized Zero-Shot Learning (GZSL) aims to recognize images not only for seen classes but also for unseen ones by transferring semantic-visual relationships from the seen to the unseen classes. It is an intuitive solution to take the advantage of generative models to hallucinate realistic unseen samples based on the knowledge learned from the seen classes. However, due to the generation shifts, the synthesized samples by most existing methods may drift from the real distribution of the unseen data. To address this issue, we propose a novel flow-based generative framework that consists of multiple conditional affine coupling layers for learning unseen data generation. Specifically, we investigate and address three essential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder. First, to improve the reflection of the semantic information in the generated samples, we proactively embed the semantic information into the transformation in each conditional affine coupling layer. Second, to promote the intrinsic feature variance of the unseen classes, we introduce a boundary sample mining strategy with entropy maximization to discover ambiguous visual variants of semantic prototypes and hereby calibrate the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding which to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Extensive experimental results on four GZSL benchmark datasets demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.

Abstract:
Underexposed images inevitably suffer severe degradation due to light distortion and noise corruption. Motivated by the limited samples of paired datasets, several unsupervised enhancement methods have been developed. However, these techniques heavily rely on pre-defined fixed lightness and noise removal constraints. Correspondingly, they cannot match the image-specific lightness when performing enhancement and can only refine details in a non-perceptual way. In this paper, we propose an Unsupervised Underexposed Image Enhancement Network (U2IENet) with self-illuminated and perceptual guidance. Specifically, to adjust the illumination for matching the image-specific lightness adaptively, we utilize the bright area of the underexposed image as the self-illuminated guidance to constrain the training process and modulate the features. Meanwhile, we introduce the perceptual guidance as a constraint to remove the noise based on illumination distribution, thus refining the details perceptually. Experiments on both underexposed datasets and public low-light datasets demonstrate the superiority of the proposed approach with higher flexibility over state-of- the-art solutions. In addition, our U2IENet also provides a side function that enables users to adjust the lightness via interactive tuning of a single parameter.

Abstract:
Self-supervised learning has considerably improved video representation learning by discovering supervisory signals automatically from unlabeled videos. However, due to the scene-biased nature of existing video datasets, the current methods are biased to the dominant scene context during action inference. Hence, this paper proposes Background Patching (BP), a scene-debiasing augmentation strategy to alleviate the model reliance on the video background in a self-supervised contrastive manner. The BP reduces the negative influence of the video background by mixing a randomly patched frame to the video background. BP randomly crops four frames from four different videos and patches them to construct a new frame for each video separately. The patched frame is mixed with all frames of the target video to produce a spatially distorted video sample. Then, we use existing self-supervised contrastive frameworks to pull representations of the distorted and original videos closer together. Moreover, BP mixes the semantic labels of patches with the target video's label, resulting in the regularization of the contrastive model to soften the decision boundaries in the embedding space. Therefore, the model is explicitly constrained to suppress the background influence by emphasizing more on the motion changes. The extensive experimental results show that our BP significantly improved the performance of various video understanding downstream tasks including action recognition, action detection, and video retrieval.

Abstract:
Existing multi-focus image fusion (MFIF) methods are difficult to achieve satisfactory results in both fusion performance and rate simultaneously. The spatial domain methods are hard to determine the focus/defocus boundary (FDB), and the transform domain methods are likely to damage the content information of the source images. Moreover, the deep learning-based MFIF methods are usually confronted with low rate due to complex models and enormous learnable parameters. To address these issues, we propose a multi-domain lightweight network (MLNet) for MFIF, which can achieve competitive results in both performance and rate. The proposed MLNet mainly includes three modules, namely focus extraction (FE), focus measure (FM) and image fusion (IF). In the interpretable FE module, the image features extracted by discrete cosine transform-based convolution (DCTConv) and local binary pattern-based convolution (LBPConv) are concatenated and fed into the FM module. DCTConv based on transform domain takes DCT coefficients to construct a fixed convolution kernel without parameter learning, which can effectively capture the high/low frequency content of the image. LBPConv based on spatial domain can achieve structure features and gradient information from source images. In the FM module, a 3-layer 1 × 1 convolution with a few learnable parameters is employed to generate the initial decision map, which has the properties of flexible input. The fused image is obtained by the IF module according to the final decision map. In terms of quantitative and qualitative evaluations, extensive experiments validate that the proposed method outperforms existing state-of-the-art methods on three public datasets. In addition, the proposed MLNet contains only 0.01 M parameters, which is 0.2% of the first CNN-based MFIF method [25].

Abstract:
Low-light image enhancement is an important task in the domain of computer vision. Images taken under insufficient lighting conditions manifest low visibility and unknown noises which disrupt image contents and pose considerable challenges for low-light image enhancement. Most of Retinex-based methods usually attempt to design different priors on the gradient of both illumination and reflectance. However, noises can be involved in the Retinex-based models. To address the problem, we explore the problem of low-light image restoration through joint contrast enhancement and denoising. We propose a Retinex-based variational model for low-light image enhancement that effectively generates a noise-free image, yet proves to generalize well to diverse light-conditions. First, we present a simple constraint on the fidelity term between the fractional derivative of an observed image and the fractional derivative of the recomposed one which is the product of the reflectance and illumination. This strategy aims to model spatial consistency to preserve natural variation. Second, we introduce a weighted regularization term for the reflectance that can remove noise with a adaptive texture map. We evaluate our proposed approach using three challenging datasets: NPE, LOL and GladNet. Extensive experiments demonstrate that our proposed method outperforms other competing methods in terms of visual quality and quantitative comparisons.

Abstract:
Semantic context has raised concerns in semantic segmentation. In most cases, it is applied to guide feature learning. Instead, this paper applies it to extract the semantic representation, which records the global feature information of each category with a memory tensor. Specifically, we propose a novel semantic representation (SR) module, which consists of semantic embedding (SE) and semantic attention (SA) blocks. The SE block adaptively embeds features into the semantic representation by calculating the memory similarity, and the SA block aggregates the embedded features with semantic attention. The main advantages of the SR module lie in three aspects: i) it enhances the representation ability of semantic context by employing global (cross-image) semantic information; ii) it improves the consistency of intraclass features by aggregating global features of the same categories; and iii) it can be extended to build a semantic representation refinement network (SRRNet) by iteratively applying the SR module across multiple scales, shrinking the semantic gap and enhancing the structural reasoning of the model. Extensive experiments demonstrate that our method significantly improves the segmentation results and achieves superior performance on the PASCAL VOC 2012, Cityscapes, and PASCAL Context datasets.

Abstract:
People are accustomed to utilizing mobile phones to capture images and uploading them to the cloud due to various incomparable advantages such as saving the storage space on the device. In this context, privacy concerns are raised since images may contain some sensitive information. Encrypting images by traditional schemes can alleviate privacy leak, but the usability is often deprived since it is tough to browse them on the cloud. Recently, Tajik et al. proposed an ideal thumbnail-preserving encryption (TPE) scheme to achieve the balance of usability and privacy by leveraging a two-pixel substitution encryption method while the Markov chain is utilized to prove the security. However, the connectivity of Markov chain in this scheme is weak since the length of pixel groups in the encryption process is only two, leading to a long mixing time of achieving the stationary distribution of Markov chain. To this end, we firstly propose a method of multi-pixel sum-preserving encryption (MP-SPE) that realizes the encipherment of vectors of arbitrary length. Then, with the help of MP-SPE, a novel flexible ideal TPE scheme (F-TPE) is designed and the connection of Markov chain is improved. The experiments have demonstrated that the proposed scheme can effectively attain the balance between usability and privacy. In addition, F-TPE takes much less time in encrypting compared with the existing work.

Abstract:
Point cloud completion is an interesting and challenging task in 3D vision, which aims to recover complete shapes from sparse and incomplete point clouds. Existing completion networks often require a vast number of parameters and substantial computational costs to achieve a high performance level, which may limit their practical application. In this work, we propose a novel Adaptive efficient Recurrent Forward Network (ARFNet), which is composed of three parts: Recurrent Feature Extraction (RFE), Forward Dense Completion (FDC) and Raw Shape Protection (RSP). In an RFE, multiple short global features are extracted from incomplete point clouds, while a dense quantity of completed results are generated in a coarse-to-fine pipeline in the FDC. Finally, we propose the Adamerge module to preserve the details from the original models by merging the generated results with the original incomplete point clouds in the RSP. In addition, we introduce the Sampling Chamfer Distance to better capture the shapes of the models and the balanced expansion constraint to restrict the expansion distances from coarse to fine. According to the experiments on ShapeNet and KITTI, our network can achieve state-of-the-art completion performances on dense point clouds with fewer parameters, smaller model sizes, lower memory costs and a faster convergence.

Abstract:
Style transfer is a useful image synthesis technique that can re-render given image into another artistic style while preserving its content information. Generative Adversarial Network (GAN) is a widely adopted framework toward this task for its better representation ability on local style patterns than the traditional Gram-matrix based methods. However, most previous methods rely on sufficient amount of pre-collected style images to train the model. In this paper, a novel Patch Permutation GAN (P^2-GAN) network that can efficiently learn the stroke style from a single style image is proposed. We use patch permutation to generate multiple training samples from the given style image. A patch discriminator that can simultaneously process patch-wise images and natural images seamlessly is designed. We also propose a local texture descriptor based criterion to quantitatively evaluate the style transfer quality. Experimental results showed that our method can produce finer quality re-renderings from single style image with improved computational efficiency compared with many state-of-the-arts methods.

Abstract:
In this paper, we tackle the task of domain adaptation under noisy environments; this is a practical and challenging problem in which the source domain is corrupted with noise in its labels, its features, or both. Noise in the source domain leads to inaccurate visual representations and makes it harder to estimate and reduce the domain discrepancy between the source and target domains, resulting in severe performance degradation in the target domain. These challenges can be addressed with offline source sample selection following robust domain discrepancy reduction. To achieve reliable sample selection, we model the uncertainty in the predictions of a convolutional neural network (CNN) classifier and reweight the classification loss by this uncertainty. Such a reweighting mechanism reduces the contribution of noise, leading to improved noise robustness. We further propose UncertaintyRank, a novel regularizer, to encourage the uncertainty to be more sensitive to noisy labels, as label corruption brings more severe degradation. The uncertainty is also aggregated with the classification loss to eliminate the adverse effects of noisy representations while estimating the domain discrepancy. Extensive experiments validate the effectiveness of our method and verify that it performs favorably against existing state-of-the-art methods.

Abstract:
The manual annotation for large-scale point clouds costs a lot of time and is usually unavailable in harsh real-world scenarios. Inspired by the great success of the pre-training and fine-tuning paradigm in both vision and language tasks, we argue that pre-training is one potential solution for obtaining a scalable model to 3D point cloud downstream tasks as well. In this paper, we, therefore, explore a new self-supervised learning method, called Mixing and Disentangling (MD), for 3D point cloud representation learning. As the name implies, we mix two input shapes and demand the model learning to separate the inputs from the mixed shape. We leverage this reconstruction task as the pretext optimization objective for self-supervised learning. There are two primary advantages: 1) Compared to prevailing image datasets, e.g., ImageNet, point cloud datasets are de facto small. The mixing process can provide a much larger online training sample pool. 2) On the other hand, the disentangling process motivates the model to mine the geometric prior knowledge, e.g., key points. To verify the effectiveness of the proposed pretext task, we build one baseline network, which is composed of one encoder and one decoder. During pre-training, we mix two original shapes and obtain the geometry-aware embedding from the encoder, then an instance-adaptive decoder is applied to recover the original shapes from the embedding. Albeit simple, the pre-trained encoder can capture the key points of an unseen point cloud and surpasses the encoder trained from scratch on downstream tasks. The proposed method has improved the empirical performance on both ModelNet-40 and ShapeNet-Part datasets in terms of point cloud classification and segmentation tasks. We further conduct ablation studies to explore the effect of each component and verify the generalization of our proposed strategy by harnessing different backbones.

Abstract:
Deep convolutional neural networks have recently been applied to improve the quality of low-light images and have achieved promising results. However, most existing methods cannot suppress noise during the enhancement process effectively, resulting in unknown artifacts and color distortions. In addition, these methods do not fully utilize illumination information and perform poorly under extremely low-light condition. To alleviate these problems, we propose the illumination guided attentive wavelet network (IGAWN) for low-light image enhancement (LLIE). Considering that the wavelet transform can separate high-frequency noise and desired low-frequency content effectively, we enhance low-light images in the frequency domain. By integrating attention mechanisms with wavelet transform, we develop the attentive wavelet transform to capture more important wavelet features, which enables the desired content to be enhanced and the redundant noise to be suppressed. To improve the image enhancement performance under extremely low-light environment, we extract illumination information from the input images and exploit it as the guidance for image enhancement through the frequency feature transform (FFT) layer. The proposed FFT layer generates frequency-aware affine transformation from the estimated illumination information, which can adaptively modulate the image features of different frequencies. Extensive experiments on synthetic and real-world datasets demonstrate that our IGAWN performs favorably against state-of-the-art LLIE methods.

Abstract:
Optical flow computation for video under the dynamic illumination is a challenging issue in video multimedia applications. In this paper, we solve this issue by introducing an illumination-invariant framework for variational optical flow estimation. It consists of an illumination-invariance model that handles complex illumination changes and a data enhancement model that guarantees highly accurate optical flow estimation. In this framework, we design a log-correlation descriptor for the data term, which handles complex illumination changes by eliminating the common parameters shared by the neighboring pixels in the corresponding illumination change model while improving the accuracy of optical flow estimation by enhancing the discriminability of the data term matching. We also introduce a novel optical flow model with L_0 norm regularization, which reconstructs optical flow field by a sparse flow gradient counting scheme. Different from other edge-preserving regularizers, it does not depend on local motion features, but locates important flow edges globally. Therefore, it will not cause edge blurriness due to avoiding local filtering or average operation. It is particularly effective for enhancing major flow edges while eliminating a manageable degree of low-amplitude motion structures to control smoothing and reduce oversegmentation artifacts. Even small-scale motion structures with high contrast can be preserved remarkably well. The experimental results show our method significantly outperforms previous illumination-robust optical flow methods in handling complex illumination changes, and achieves competitive evaluation results on the challenging MPI-Sintel and Kitti datasets.

Abstract:
Most existing person re-identification (Re-ID) methods rely on the visual appearance of the human body. However, face cues are rarely explored in the Re-ID community despite the face that it is an important biometric identifier for human beings. In this work, we propose a Similarity Ensemble Framework (SEF) that uses multi-cue similarity embedding and propagation to effectively fuse body and face information for person re-identification. Specifically, for each query, we first perform standard pedestrian retrieval using body and face cues, respectively, to obtain some candidate results with high confidence. Next, the body and face similarities are combined and embedded into a shared space as node features, and two graphs with the same nodes and different edges with respect to body and face affinities are constructed. Then, the similarity features are propagated in both body and face graphs using graph convolution to capture the relationship among the candidates using different cues. Lastly, the refined features are used to compute the final similarities with the query. The proposed method not only combines the similarities of body and face, but also takes into account the relationship among all the other candidate samples under different cues. Extensive experiments demonstrate that the use of face cues effectively improves the performance of person Re-ID even if the performance obtained by the face alone is much lower than that of the body, suggesting that our approach is able to capture valuable information beyond body from weaker face cues in person Re-ID scenarios.

Abstract:
The interpretation of tactile stimuli empowers humans to identify substances, distinguish materials, and engage in tactile communication. For stimulus design in human-computer interaction, objective similarity measures improve efficiency and save costs. Inspired by the fact that biological systems are robust in recognizing multimedia stimuli, we propose a neuromorphic method for similarity measurement. The method is divided into two steps. First, tactile information is translated into biological representations by mimicking a low-threshold mechanoreceptor through a physiological neuronal model. Then, three measures are nominated to assess the similarity of neural spike trains from the following perspectives: interval spike counting, temporal matching, and vector space embedding. Regression analysis showed that the linearity of these measures was significant, indicating that the filtering ability of the physiological neuron model is robust. One of the measures is selected for comparison with the signal-to-noise ratio, structural similarity, and hybrid metric. The results suggest that the correlation between the predictions of our method and the subjective evaluation is stable, above 0.9 for each experimental stimulus. We achieve a mutual interpretation between quantitative measures of vibrotactile similarity and subjective cognitive outcomes. Furthermore, the feasibility of this method in material classification has been substantiated through an exploratory experiment.

Abstract:
Image-based salient object detection (ISOD) in 360^\circ scenarios is significant for understanding and applying panoramic information. However, research on 360^\circ ISOD has not been widely explored due to the lack of large, complex, high-resolution, and well-labeled datasets. Towards this end, we construct a large scale 360^\circ ISOD dataset with object-level pixel-wise annotation on equirectangular projection (ERP), which contains rich panoramic scenes with not less than 2K resolution and is the largest dataset for 360^\circ ISOD by far to our best knowledge. By observing the data, we find current methods face three significant challenges in panoramic scenarios: diverse distortion degrees, discontinuous edge effects and changeable object scales. Inspired by humans' observing process, we propose a view-aware salient object detection method based on a Sample Adaptive View Transformer (SAVT) module with two sub-modules to mitigate these issues. Specifically, the sub-module View Transformer (VT) contains three transform branches based on different kinds of transformations to learn various features under different views and heighten the model's feature toleration of distortion, edge effects and object scales. Moreover, the sub-module Sample Adaptive Fusion (SAF) is to adjust the weights of different transform branches based on various sample features and make transformed enhanced features fuse more appropriately. The benchmark results of 20 state-of-the-art ISOD methods reveal the constructed dataset is very challenging. Moreover, exhaustive experiments verify the proposed approach is practical and outperforms the state-of-the-art methods.

Abstract:
Expensive annotation costs significantly hinder the development of facial landmark tracking owing to the frame-by-frame labeling of dense landmarks. The most promising approach to address this problem is to develop a self-supervised tracker for large-scale unlabeled videos. However, existing self-supervised trackers trained using single-sourced knowledge are unstable under unconstrained environments. Herein, we propose multi-sourced knowledge integration (MSKI), a robust self-supervised tracking method. It integrates knowledge from multiple sources to provide supervisory signals, thereby improving the stability of the self-supervised tracker. Specifically, the proposed MSKI comprises two complementary modules: a temporal knowledge reasoning (TempRes) module and an interactive knowledge distillation (KnowDist) module. The TempRes module enforces the tracker to achieve cycle-consistent tracking, allowing the tracker to learn temporal correspondence based on the cycle-consistency of time. To exploit facial geometry knowledge against various occlusions, our tracker imposes a multi-level shape constraint over the structure of facial landmarks by leveraging adversarial shape learning, thereby enabling the tracking of occluded faces. Moreover, the tracker interacts with an initialization detector to further develop complementary knowledge via KnowDist. The KnowDist module distills the spatial and temporal knowledge provided by the detector and tracker to generate plausible labels automatically. Finally, these generated labels are utilized to fine-tune the detector, such that it provides high-quality initial landmarks for the cycle-consistent tracking of the tracker on unlabeled videos. The experimental results show that the proposed MSKI can stabilize the tracking trajectory and improve the robustness against various occlusions.

Abstract:
Multi-view clustering is a long-standing important task, however, it remains challenging to exploit valuable information from the complex multi-view data located in diverse high-dimensional spaces. The core issue is the effective collaboration of multiple views to holistically uncover the essential correlations between multi-view data through graph learning. Furthermore, it is indispensable for most existing methods to introduce an additional clustering step to produce the final clusters, which evidently reduces the uniform relationship between graph learning and clustering. Based on the above considerations, in this paper, we present a novel method named multi-view clustering via graph collaboration (MCGC). Based on the low-dimensional representation space developed by MCGC, it first perceives the correlations between samples in each individual view under the supervision of the Hilbert-Schmidt independence criterion (HSIC). Then, MCGC proposes learning a consensus graph by adaptively collaborating between all the views, which is able to uncover the essential structure of the multi-view data. Meanwhile, by imposing the rank constraint on the Laplacian matrix of the consensus graph to partition the multi-view data naturally into the required number of clusters, the optimal clustering results can be obtained directly without any postprocessing steps. Finally, the resulting optimization problem is solved by an alternating optimization scheme with guaranteed fast convergence. Extensive experiments on 5 benchmark multi-view datasets demonstrate that MCGC markedly outperforms the state-of-the-art baselines.

Abstract:
3D face reconstruction from a single image is a vital task in various multimedia applications. A key challenge for 3D face shape reconstruction is to build the correct dense face correspondence between the monocular input face and the deformable mesh. Most existing methods rely on shape labels fitted by traditional methods or strong priors such as multi-view geometry consistency. In contrast, we propose an innovative 3D Modulated Morphable Model (3D3M) to learn the dense shape correspondence from monocular images in a self-supervised manner. Specifically, given a batch of input faces, 3D3M encodes their 3DMM attributes (shape, texture, lighting, etc.) and then randomly shuffles the 3DMM attributes to generate the attribute-changed faces. The attribute-changed faces can be encoded and rendered back in a cycle-consistent manner, which enables us to utilize the self-supervised consistencies in dense mesh vertices and reconstructed pixels. The dense shape and pixel correspondence enable us to adopt a series of self-supervised constraints to fit the 3D face model accurately and learn the per-vertex correctives end-to-end. 3D3M builds excellent high-quality 3D face reconstruction results from monocular images. Both quantitative and qualitative experimental results have verified the superiority of 3D3M over prior arts on 3D face reconstruction and face alignment.

Abstract:
Point cloud segmentation is fundamental in under- standing 3D environments. However, most existing methods usually perform poorly on identifying boundaries of touching objects and large surfaces of objects. Planes in a scene usually act as supporting surfaces to separate touching objects and provide geometry priors to group points on a large surface as shown in Fig. 1. Besides, planes can roughly represent the structure of a scene, and are more efficient to encode holistic scene contexts than large scale point clouds. In light of the above advantages, we advise a plane-assisted module, coined 3D-PAM, to enhance semantic segmentation of touching objects and large surface objects. 3D-PAM consists of a plane separation network (PS-Net) and a plane relation network (PR-Net). PS-Net focuses on learning features that can robustly separate touching objects, e.g., a chair on a floor, as well as capture plane-based geometry priors to group points on a large plane, e.g., points of a desk. PR-Net encodes mutual plane relations as a proxy of a scene structure to capture holistic contexts. 3D-PAM is designed as a plug-and-play module so that it can be easily plugged into any off-the-shelf semantic segmentation network. Extensive experiments demonstrate that the method achieves large segmentation improvements on several backbones, and accomplishes superior results on most categories when using a RandLA-Net backbone (11/13 categories on S3DIS dataset and 15/20 categories on ScanNetv2 dataset). The project is available at GitHub https://github.com/windmillknight/Context-Aware-3D-Point-Cloud-Semantic-Segmentation-With-Plane-Guidance

Abstract:
In recent years, binary hashing methods have been widely used in large-scale multimedia retrieval because of the low computational complexity and memory cost. Generally, better retrieval accuracy can be achieved with a longer hash code, which, however, may suffer redundancy. In this paper, we propose a novel hash bit selection method, called Hash Bit Selection with Reinforcement Learning (HBS-RL), which aims to adaptively select the most informative bits from the database binary codes. In our approach, the hash bit selection problem is firstly modeled as a Markov Decision Process (MDP), which is solved with reinforcement learning. HBS-RL learns a policy for bit selection, which effectively identifies the most informative bits by directly maximizing mean Average Precision (mAP) during training. Specially, given a generated bit pool, our HBS-RL can sequentially select bits with different code lengths with a very lightweight fully-connected policy network. The proposed method is evaluated on the MNIST, CIFAR-10, ImageNet and NUS-WIDE datasets, and the results show that it significantly improves the retrieval performance of the existing unsupervised and deep supervised hashing methods. It also outperforms the state-of-the-art bit selection methods. For convenience of repeating our results, we release our source code at: https://github.com/xyez/HBS-RL.

Abstract:
The goal of unpaired image captioning (UIC) is to describe images without using image-caption pairs in the training phase. Although challenging, we expect the task can be accomplished by leveraging images aligned with visual concepts. Most existing studies use off-the-shelf algorithms to obtain the visual concepts because the Bounding Box (BBox) labels or relationship-triplet labels used for training are expensive to acquire. To avoid exhaustive annotations, we propose a novel approach to achieve cost-effective UIC. Specifically, we adopt image-level labels to optimize the UIC model in a weakly-supervised manner. For each image, we assume that only the image-level labels are available without specific locations and numbers. The image-level labels are utilized to train a weakly-supervised object recognition model to extract object information (e.g., instance), and the extracted instances are adopted to infer the relationships among different objects using an enhanced graph neural network (GNN). The proposed approach achieves comparable or even better performance compared with previous methods without expensive annotations. Furthermore, we design an unrecognized object (UnO) loss to improve the alignment of the inferred object and relationship information with the images. It can effectively alleviate the issue encountered by existing UIC models when generating sentences with nonexistent objects. To the best of our knowledge, this is the first attempt to address the problem of Weakly-Supervised visual concept recognition for UIC (WS-UIC) based only on image-level labels. Extensive experiments demonstrate that the proposed method achieves inspiring results on the COCO dataset while significantly reducing the labeling cost.

Abstract:
The past few years have witnessed the great success of multi-frame quality enhancement for compressed video. Although the existing methods based on deformable alignment have achieved the state-of-the-art performance, they do not pay enough attention to the recovery of detail information. In this work, we propose a Spatio-Temporal Detail Retrieval (STDR) method to promote the recovery of detail information. To alleviate the problem of inaccurate deformable offsets caused by the fixed receptive field, motivated by multi-task learning, we design a plug-and-play Multi-path Deformable Alignment (MDA) module to generate more accurate offsets by integrating the alignment features of different receptive fields, so that the temporal detail information can be better recovered. For the spatial detail information restoration, several residual dense blocks with channel attention layer are utilized in the reconstruction module to explore valuable high-frequency spatial information from the fused multi-path alignment features. Meanwhile, a complementary loss function based on the Pearson correlation coefficient is developed to ameliorate the over-smoothing shortcoming caused by pixel-wise mean square or absolute value loss. Experimental results demonstrate that the proposed STDR network achieves superior performance compared with the state-of-the-art methods in both quantitative and qualitative evaluations.

Abstract:
Various studies have been conducted on instance segmentation and made great strides over the past few years. Most recently, instance-specific mask generation via dynamic kernel predictions has shown the significant performance improvement even without bounding boxes as well as anchors. However, this scheme still does not fully consider dynamic properties since the size of the receptive field is not enough to cover the spatially-meaningful range due to memory limitations. Furthermore, the single-fused feature often fails to grasp complicated boundaries for objects of different sizes. In this article, we propose the dynamic residual filtering method with the Laplacian pyramid, which separately restores the global layout and local boundaries of instance masks. Specifically, we firstly apply the Laplacian pyramid-based decomposition scheme to features encoded from the backbone and subsequently restore sub-band mask residuals from coarse to fine pyramid levels. To do this, we design spatially-aware convolution filters to progressively capture the residual form of mask features at each level of the Laplacian pyramid while holding deformable receptive fields with dynamic offset information. This is fairly desirable since global and local properties of mask features can be accurately restored with keeping the spatial flexibility through the invertible process of the Laplacian reconstruction. Experimental results on the COCO dataset demonstrate that our proposed method achieves the state-of-the-art performance, i.e., 42.7% AP. The code and model are publicly available at: https://github.com/tjqansthd/LapMask.

Abstract:
Existing few-shot segmentation approaches basically adopt the idea of comparing the semantic prototype vector of the query image and support images, and then obtaining the segmentation result. However, recent studies have shown that a single feature vector in feature map cannot accurately represent pixel-level categories, thus leading to poor segmentation of object boundary and semantic ambiguity. To address this common problem, we propose a novel contour-aware network (CTANet) for few-shot segmentation in this paper. Unlike the usual practice of classifying each pixel separately, CTANet regards all pixels within the same contour as a whole, which can take advantage of the internal consistency of objects to obtain a more accurate representation of category information. To obtain the accurate object contour, our network consists of a contour generation module and a contour refinement module, where the former exploits multiple levels of features to generate a primary contour map and the latter learns to refine the primary contour map. Furthermore, a novel contour-aware mixed loss is proposed to fuse the common BCE loss and our contour-aware loss to supervise the training process on two levels, pixel-level and contour-level. Extensive experiments demonstrate that our CTANet achieves a new state-of-the-art performance on \textPASCAL-5^i and \textCOCO-20^i. Hopefully, our new perspective could provide more clues for future research on few-shot segmentation. Our code is freely available at: https://github.com/hardtogetA/CTANet.

Abstract:
Transformer architectures have recently been introduced into the field of visual question answering (VQA), due to their powerful capabilities of information extraction and fusion. However, existing Transformer-like models, including models using a single Transformer structure and large-scale pre-training generic visual-linguistic models, do not fully utilize both positional information of words in questions and positional information of objects in images, which are shown in this paper to be crucial in VQA tasks. To address this challenge, we propose a novel positional attention guided Transformer-like architecture, which can adaptively extracts positional information within and across the visual and language modalities, and use this information to guide high-level interactions in inter- and intra-modality information flows. In particular, we design and assemble three positional attention modules into a single Transformer-like model MCAN. We show that the positional information introduced in intra-modality interaction can adaptively modulate inter-modality interaction according to different inputs, which plays an important role for visual reasoning. Experimental results demonstrate that our model outperforms the state-of-the-art models and is particularly good at handling object counting questions. Overall, our model achieves the accuracy of 70.10%, 71.27%, and 71.52% on the datasets of COCO-QA, VQA v1.0 test-std and VQA v2.0 test-std, respectively.

Abstract:
Text-to-image synthesis is an attractive but challenging task that aims to generate a photo-realistic and semantic consistent image from a specific text description. The images synthesized by off-the-shelf models usually contain limited components compared with the corresponding image and text description, which decreases the image quality and the textual-visual consistency. To address this issue, we propose a novel Vision-Language Matching strategy for text-to-image synthesis, named VLMGAN, which introduces a dual vision-language matching mechanism to strengthen the image quality and semantic consistency. The dual vision-language matching mechanism considers textual-visual matching between the generated image and the corresponding text description, and visual-visual consistent constraints between the synthesized image and the real image. Given a specific text description, VLMGAN firstly encodes it into textual features and then feeds them to a dual vision-language matching-based generative model to synthesize a photo-realistic and textual semantic consistent image. Besides, the popular evaluation metrics for text-to-image synthesis are borrowed from simple image generation, which mainly evaluate the reality and diversity of the synthesized images. Therefore, we introduce a metric named Vision-Language Matching Score (VLMS) to evaluate the performance of text-to-image synthesis which can consider both the image quality and the semantic consistency between the synthesized image and the description. The proposed dual multi-level vision-language matching strategy can be applied to other text-to-image synthesis methods. We implement this strategy on two popular baselines, which are marked with \textVLMGAN_+\textAttnGAN and \textVLMGAN_+\textDFGAN. The experimental results on two widely-used datasets show that the model achieves significant improvements over other state-of-the-art methods.

Abstract:
Removing undesirable reflections in photographs benefits both human perceptions and downstream computer vision tasks, but it is a highly ill-posed problem based on a single RGB image. Different from RGB images, near-infrared (NIR) images captured by an active NIR camera are less likely to be affected by reflections when glass and camera planes form certain angles, while textures on objects could “vanish” in some situations. Based on this observation, we propose a cascaded reflection removal network with an image feature fusion strategy to utilize auxiliary information in active NIR images. To tackle the insufficiency of training data, we propose a data generation pipeline to approximate perceptual properties and the reflection-suppressing nature of active NIR images. We further build a dataset with synthetic and real images to facilitate the research. Experimental results show that the proposed method outperforms state-of-the-art reflection removal methods in both quantitative metrics and visual quality.

Abstract:
Seeking feature correspondences among two or more images is an important problem in computer vision and image processing. The putative matches constructed by the similarity of feature descriptors are often contaminated by many false matches. Typically, the local neighborhood points of a true match point have a rank order, which will be maintained in the corresponding image, and we call it rank consistency. In this paper, we design a number of sorting plans to obtain the neighborhood rank lists by taking full advantage of the local neighborhood geometry structure. In order to measure the differences between rank lists, we adopt the statistically famous Kendall rank correlation coefficient and generalize its definition for matching problem. We design a neighborhood common element guidance strategy and a multi-neighborhood strategy to improve the universality and robustness of our method. Our method has linear complexity and it has superiority over state-of-the-art methods on several challenging data sets. It also performs well in image registration and loop-closure detection tasks. The source code of our method is publicly available at https://github.com/MnYangs/mGKRCC.

Abstract:
Visual sentiment analysis aims to predict human emotional responses to visual stimuli. It has attracted considerable attention owing to the increasing popularity of online image sharing. Most researchers have focused on improving emotion recognition using holistic and local information derived from given images. Relatively less attention has been paid to the semantic information of objects in images, which influences human emotional responses to the images. Therefore, we propose a novel object semantic attention network (OSANet) that attempts to unravel the semantic information of objects in images that contribute to emotion detection. The OSANet combines both global representation and semantic information of objects to predict the emotion elicited by a given image. First, the holistic features that represent the entire image are extracted using convolutional blocks. Subsequently, the object-level semantic information is obtained from pre-trained word embedding and then weighted according to the relative importance of the object using the attention mechanism. Notably, a new loss function to address the subjectivity of sentiment analysis is introduced, which improves the performance of the emotion detection task. Extensive experiments on three image emotion datasets demonstrated the superiority and interpretability of the OSANet. The results show that the OSANet outperforms extant image emotion detection frameworks.

Abstract:
Recent research on deep convolutional neural networks (CNNs) has provided a significant performance boost on efficient super-resolution (SR) tasks by trading off the performance and applicability. However, most existing methods focus on subtracting feature processing consumption to reduce the parameters and calculations without refining the immediate features, which leads to inadequate information in the restoration. In this paper, we propose a lightweight network termed DDistill-SR, which significantly improves the SR quality by capturing and reusing more helpful information in a static-dynamic feature distillation manner. Specifically, we propose a plug-in reparameterized dynamic unit (RDU) to promote the performance and inference cost trade-off. During the training phase, the RDU learns to linearly combine multiple reparameterizable blocks by analyzing varied input statistics to enhance layer-level representation. In the inference phase, the RDU is equally converted to simple dynamic convolutions that explicitly capture robust dynamic and static feature maps. Then, the information distillation block is constructed by several RDUs to enforce hierarchical refinement and selective fusion of spatial context information. Furthermore, we propose a dynamic distillation fusion (DDF) module to enable dynamic signals aggregation and communication between hierarchical modules to further improve performance. Empirical results show that our DDistill-SR outperforms the baselines and achieves state-of-the-art results on most super-resolution domains with much fewer parameters and less computational overhead.

Abstract:
Pose-guided person image generation that aims to transfer the pose of a given person to a target pose has recently received lots of research attention. Due to the spatial misalignment and occlusions of different local body parts by pose variations, this task is still challenging especially in maintaining high-fidelity textures and body structures in generated images. Besides, most works also suffer from the limited number of texture styles in the given person datasets, restricting the diversity of generated persons' appearances. To solve these problems, we design a Kernel-based Texture-Fusion Joint Refinement Network (TFJR-Net) to jointly refine the structure and texture information of generated images. First, we leverage a bone-map representation to guide the generation of human parsing maps, which has more structure priors and richer context information than traditional key-point maps, thus reduce the uncertainty of generated body structures. Next, a Texture-Kernel Injection Normalization module (TKIN) is proposed to inject the per-region texture-kernel into the corresponding semantic region from the human parsing map, which decouples the texture and shape information, and also preserves fine-grained features for complex textures. Furthermore, we are the first to introduce external texture patterns outside of the dataset in human semantic regions such as the upper clothes. We fuse the two texture domains in a shared texture space through our designed texture-fusion TKIN modules. Extensive experiments are conducted on the Deepfashion dataset, with the DTD dataset as an external texture source. The experimental results demonstrate the superiority of our proposed method in generating persons of better textures and structures than state-of-the-art works, and also show the generalization ability of our proposed method to absorb diversified external textures for generating person images. The source codes are available at https://github.com/pilgrim00/TKIN.

Abstract:
Current continuous sign language recognition systems generally target on a single language. When it comes to the multilingual problem, existing solutions often build separate models based on the same network and then train them with their corresponding sign language corpora. Observing that different sign languages share some low-level visual patterns, we argue that it is beneficial to optimize the recognition model in a collaborative way. With this motivation, we propose the first unified framework for multilingual continuous sign language recognition. Our framework consists of a shared visual encoder for visual information encoding, multiple language-dependent sequential modules for long-range temporal dependency learning aimed at different languages, and a universal sequential module to learn the commonality of all languages. An additional language embedding is introduced to distinguish different languages within the shared temporal encoders. Further, we present a max-probability decoding method to obtain the alignment between sign videos and sign words for visual encoder refinement. We evaluate our approach on three continuous sign language recognition benchmarks, i.e., RWTH-PHOENIX-Weather, CSL and GSL-SD. The experimental results reveal that our method outperforms the individually trained recognition models. Our method also demonstrates better performance compared with state-of-the-art algorithms.

Abstract:
In this paper, we propose a novel approach to video object segmentation where dual streams consisting of a shared network and a special network are designed to constitute the feature memory of history frames. Cues of spatial position and time stamp are explicitly explored to learn the context for each frame in the video sequence. Self-attention and cross-attention are simultaneously exploited to extract more powerful features for segmentation. In contrast to STM and its variants, the proposed dual cross-attention performs in both appearance space and semantic space such that the derived features are more distinctive and then robust to similar overlapping objects. During decoding for segmentation, a local refinement technique is designed for the uncertain boundaries to obtain more precise and smooth object contours. Experimental results on the challenging benchmark datasets DAVIS-2016, DAVIS-2017, and YouTube-VOS demonstrate the effectiveness of our proposed approach to video object segmentation.

Abstract:
State-of-the-art deep learning based stereo matching algorithms usually rely on full-size cost volumes for highly accurate disparity estimation. The full-size cost volume processes all possible disparity candidates equally without considering their different matching uncertainties. Consequently, considerable redundant computation is involved on those candidates with very low matching uncertainties, making these methods difficult to be deployed in real-time applications. To tackle this problem, we propose CVCNet featuring an adaptive disparity range prediction module (ADR) and a disparity refinement module (DRM). The ADR adaptively predicts pixel-wise disparity range to discard the “unimportant” disparity candidates. It enables our network to obtain a compressed cost volume. Besides, the DRM improves disparity range prediction and refines the predicted disparity map. With the proposed modules, our CVCNet learns to build a compressed cost volume to achieve efficient disparity estimation. Experimental results on the KITTI and SceneFlow datasets show that our method achieves state-of-the-art performance, and runs at a significant order of magnitude faster speed than existing 3D CNN based methods. Particularly, our method ranks \mathbf 1\mathrmst on the KITTI 2012 and KITTI 2015 benchmarks among all published methods with running time shorter than 100 ms.

Abstract:
Video-language pre-training (VLP) has attracted increasing attention for cross-modality understanding tasks. To enhance visual representations, recent works attempt to adopt transformer-based architectures as video encoders. These works usually focus on the visual representations of the sampled frames. Compared with frame representations, frame patches incorporate more fine-grained spatio-temporal information, which could lead to a better understanding of video contents. However, how to exploit the spatio-temporal information within frame patches for VLP has been less investigated. In this work, we propose a method to learn tube tokens to model the key spatio-temporal information from frame patches. To this end, multiple semantic centers are introduced to focus on the underlying patterns of frame patches. Based on each semantic center, the spatio-temporal information within frame patches is integrated into a unique tube token. Complementary to frame representations, tube tokens provide detailed clues of video contents. Furthermore, to better align the generated tube tokens and the contents of descriptions, a local alignment mechanism is introduced. The experiments based on a variety of downstream tasks demonstrate the effectiveness of the proposed method.

Abstract:
Recent works have shown that the joint-detection-and-embedding (JDE) paradigm has significantly enhanced the performance of multiple object tracking by simultaneously learning detection and re-identification features. These methods always utilize a weight-shared backbone network and two non-interactive branches for different tasks. This non-interactive multi-task learning strategy cannot make full use of geometric and semantic information between detection and re-identification tasks. And in the JDE paradigm, there exists a feature misalignment between detection and re-identification due to their different optimization directions. In this article, BGTracker is proposed as a novel online tracking framework with a cross-task bidirectional guidance strategy between detection and re-identification. Firstly, we propose a Channel-based Decoupling module and Cross-direction Transformer to alleviate feature misalignment, which can obtain task-aligned embeddings and discriminative representations at the feature level. Then, we propose the bidirectional guidance strategy to link the two tasks by the prediction map's statistical information. In this strategy, two designed feature transformations are employed to utilize the advantages of each task for complementing each other at the task level. Finally, extensive experiments demonstrate that the proposed BGTracker outperforms various existing methods on the MOTChallenge benchmarks.

Abstract:
Blind image quality assessment (BIQA) for in-the-wild images has achieved great progress by training advanced deep neural networks. However, the current BIQA models are suffering the generalization challenge, meaning that a well-trained BIQA model is still very limited in evaluating images with different distributions. Deep BIQA models are data-intensive, but the annotation of image quality labels is extremely expensive. To design a generalizable BIQA model with few training samples is highly desired. Motivated by the above fact, this paper presents a knowledge-guided BIQA (KG-IQA) framework by integrating domain knowledge from the human visual system (HVS) and natural scene statistics (NSS). Specifically, the quality-aware HVS and NSS features are first extracted as prior knowledge. Then, we embed the two types of knowledge into the conventional deep neural network by learning to predict the HVS and NSS features, producing the knowledge-enhanced quality features, based on which the final image quality score is obtained. We conduct extensive experiments and comparisons on five authentically distorted IQA datasets. The experimental results demonstrate that the introduction of knowledge greatly reduces the requirement on the amount of training images, and the proposed KG-IQA model achieves superior performance in terms of both prediction accuracy and generalization ability.

Abstract:
Recent advances in Graph Neural Networks (GNNs) have achieved superior results in many challenging tasks, such as few-shot learning. Despite its capacity to learn and generalize a model from only a few annotated samples, GNN is limited in scalability, as deep GNN models usually suffer from severe over-fitting and over-smoothing. In this work, we propose a novel GNN framework with a triple-attention mechanism, i.e., node self-attention, neighbor attention, and layer memory attention, to tackle these challenges. We provide both theoretical analysis and illustrations to explain why the proposed attentive modules can improve GNN scalability for few-shot learning tasks. Our experiments show that the proposed Attentive GNN model outperforms the state-of-the-art few-shot learning methods using both GNN and non-GNN approaches. The improvement is consistent over the mini-ImageNet, tiered-ImageNet, CUB-200-2011, and Flowers-102 benchmarks, using both ConvNet-4 and ResNet-12 backbones, and under both the inductive and transductive settings. Furthermore, we demonstrate the superiority of our method for few-shot fine-grained and semi-supervised classification tasks with extensive experiments.

Abstract:
Current vision language pretraining models are dominated by methods using region visual features extracted from object detectors. Given their good performance, the extract-then-process pipeline significantly restricts the inference speed and therefore limits their real-world use cases. However, training vision language models from raw image pixels is difficult, as the raw image pixels give much less prior knowledge than region features. In this paper, we systematically study how to leverage auxiliary visual pretraining tasks to help training end-to-end vision language models. We introduce three types of visual losses that enable much faster convergence and better finetuning accuracy. Compared with region feature models, our end-to-end models could achieve similar or better performance on down-stream tasks and run more than 10 times faster during inference. Compared with other end-to-end models, our proposed method could achieve similar or better performance when pretrained for only 10% of the pretraining GPU hours.

Abstract:
The visible-infrared person re-identification (VI-ReID) is a challenging ReID task, which aims to retrieve and match the same identity's images between the heterogeneous visible and infrared modalities. Thus, the core of this task is to bridge the huge gap between these two modalities. The existing methods mainly face the problem of insufficient perception of modality information, and can not learn good discriminative modality-invariant embeddings for identities, which limits their performance. To solve these problems, we propose a new cross-modality transformer-based method (CMTR) for this visible-infrared person re-identification task, which can explicitly mine the information of each modality and generate better discriminative features based on it. Specifically, to capture inherent characteristics of modalities, we design the novel modality embeddings, which are fused with token embeddings to encode modality information directly. Moreover, to enhance representation of modality embeddings and adjust the distribution of embeddings, we further propose a modality-aware enhancement loss based on the learned modality information, which contains two components to reduce intra-class distance and enlarging inter-class distance simultaneously. To our knowledge, this is the first exploration of applying pure transformer network to the cross-modality re-identification task. We implement extensive experiments on the public SYSU-MM01 and RegDB datasets, and compared with previous methods, our method achieves good performance with more compact and powerful embeddings for the cross-modality retrieval.

Abstract:
Recently, realistic DeepFake videos have raised severe security concerns in society. Existing video-based detection methods observe local spatial regions with the coarse temporal view, thus it is difficult to obtain subtle spatiotemporal information, resulting in limited generalization ability. In this paper, we propose a novel Augmented Multi-scale Spatiotemporal Inconsistency Magnifier (AMSIM) with a Global Inconsistency View (GIV) and a more meticulous Multi-timescale Local Inconsistency View (MLIV), focusing on mining comprehensive and more subtle spatiotemporal cues. Firstly, the GIV that includs the global spatial and long-term temporal views is established to ensure comprehensive spatiotemporal clues are captured. Then, the MLIV with the critical local spatial and multi-timescale local temporal views is designed for magnifying the indetectable spatiotemporal abnormality. Subsequently, GIV is utilized to guide MLIV to dynamically find local spatiotemporal anomalies that are highly relevant to the overall video. Finally, to further obtain a generalized framework, the adversarial data augmentation is specially designed to expand source domains and simulate unseen forgery domains. Extensive experiments on six large-scale datasets show that our AMSIM outperforms state-of-the-art detection methods and remains effective when applied to unseen forgery techniques and datasets.

Abstract:
Few-shot semantic segmentation is the task of learning to locate each pixel of the novel class in the query image with only a few annotated support images. The current correlation-based methods construct pair-wise feature correlations to establish the many-to-many matching because the typical prototype-based approaches cannot learn fine-grained correspondence relations. However, the existing methods still suffer from the noise contained in naive correlations and the lack of context semantic information in correlations. To alleviate these problems mentioned above, we propose a Feature-Enhanced Context-Aware Network (FECANet). Specifically, a feature enhancement module is proposed to suppress the matching noise caused by inter-class local similarity and enhance the intra-class relevance in the naive correlation. In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features, significantly boosting the encoder to capture a reliable matching pattern. Experiments on PASCAL-5^i and COCO-20^i datasets demonstrate that our proposed FECANet leads to remarkable improvement compared to previous state-of-the-arts, demonstrating its effectiveness.

Abstract:
Change captioning aims to describe the disagreement of image pairs with a linguistic sentence. Compared with single image captioning, change captioning requires not only understanding the fine-grained information of each image, but also determining whether change occurs and further representing the differences of image pairs. Although much progress has been made, it remains a severe challenge of the precise difference representation in the distraction of viewpoint change, especially that of tiny difference. In this paper, we propose a novel Intra- and Inter-representation Interaction Network (I3N) to learn the fine difference representation and be immune to viewpoint change. In the Intra-representation Interaction stage, we design Geometry-Semantic Interaction Refining (GSIR) to explore the positional and semantic interactions of intra-image, which can be a prior knowledge of enduring viewpoint change and reinforce the cognition of semantic change. In the Inter-representation Interaction stage, to endow the model with the capability of pinpointing the latent difference in viewpoint change, Hierarchical Representation Interaction (HRI) models difference from coarse to fine representations through the Semantic Matcher and Change Amplifier module. The proposed approach outperforms the state-of-the-art methods with an encouraging performance on the existing change captioning benchmarks.

Abstract:
In observing images, the perception of the human visual system (HVS) is affected by both image contents and distortions. Obviously, the visual quality of the same image varies under different distortion types and intensities. Furthermore, the visual masking effects reveal that image content and distortion have a visual interaction, where the HVS presents different visibility of the identical distortion for different image contents. Based upon this, we propose a visual interaction perceptual network that can perceive both content and distortion of an image. The proposed model consists of three sub-modules: content perception module (CPM), distortion perception module (DPM), and visual interaction module (VIM). However, the subjective quality score cannot guide the model to explicitly learn the feature representations of image content and distortion. Thus, we perform a two-stage training procedure. In the first stage, we obtain CPM and DPM, where semantic features are extracted to recognize the image content in CPM, and distortion features are extracted to capture the image distortion type and intensity in DPM. In the second stage, the VIM is applied to model the interaction between semantic and distortion features, and the final predicted quality score is given by a fully connected layer. Experimental results demonstrate that the proposed method can achieve state-of-the-art performance on multiple benchmark databases, e.g., CSIQ, TID2013, KADID-10K, and KonIQ-10 K.

Abstract:
Mobile edge computing is a promising framework for mobile virtual reality (VR) game. Although there are several existing studies on the edge assisted mobile VR game system, they lack the consideration of provisioning services with satisfactory QoE to a large number of users. In this paper, we consider the problem of providing QoE-oriented edge assisted mobile VR game as a service to multiple users, with a comprehensive QoE concern of both visual and delay aspects. Due to the unique features of mobile VR game, the problem is formulated into a Mixed Integer Quadratically Constrained Quadratic Programming (MIQCQP) problem. We show that the problem is NP-hard with object placement decision and rendering level selection decision quadratically coupling together. To solve this problem, we propose the Alternating Directions Method of Multipliers (ADMM) algorithm which can iteratively decouple the quadratic terms and reform the problem into the efficiently solvable MIQCQP-1 (i.e., MIQCQP with one constraint) problem. Trace driven simulation shows that our algorithm fits the edge assisted mobile VR game scenario well with fast computation time (at least 4 orders of magnitude less computation time compared to Gurobi solver) and good performance (at least 18% of user visual QoE improvement compared to other mobile VR scheme).

Abstract:
The task of video-query based video moment retrieval (VQ-VMR) aims to localize the segment in the reference video, which matches semantically with a short query video. This is a challenging task due to the rapid expansion and massive growth of online video services. With accurate retrieval of the target moment, we propose a new metric to effectively assess the semantic relevance between the query video and segments in the reference video. We also develop a new VQ-VMR framework to discover the intrinsic semantic relevance between a pair of input videos. It comprises two key components: a Fine-grained Feature Interaction (FFI) module and a Semantic Relevance Measurement (SRM) module. Together they can effectively deal with both the spatial and temporal dimensions of videos. First, the FFI module computes the semantic similarity between videos at a local frame level, mainly considering the spatial information in the videos. Subsequently, the SRM module learns the similarity between videos from a global perspective, taking into account the temporal information. We have conducted extensive experiments on two key datasets which demonstrate noticeable improvements of the proposed approach over the state-of-the-art methods.

Affiliations: Computational Science Research Center, San Diego State University, San Diego, CA, USA; Department of Computer Science, San Diego State University, San Diego, CA, USA; K. Lisa Yang Center for Conservation Bioacoustics, Cornell University, Ithaca, NY, USA; Department of Fish, Wildlife and Conservation Biology, Colorado State University, Corvallis, OR, USA; Sea Mammal Research Unit Scottish Oceans Institute, University of St Andrews, St Andrews, U.K.; Department of Ocean and Resources Engineering, University of Hawaii, Honolulu, HI, USA; Spotify, Inc., New York, NY, USA

Abstract:
Whistle contour extraction aims to derive animal whistles from time-frequency spectrograms as polylines. For toothed whales, whistle extraction results can serve as the basis for analyzing animal abundance, species identity, and social activities. During the last few decades, as long-term recording systems have become affordable, automated whistle extraction algorithms were proposed to process large volumes of recording data. Recently, a deep learning-based method demonstrated superior performance in extracting whistles under varying noise conditions. However, training such networks requires a large amount of labor-intensive annotation, which is not available for many species. To overcome this limitation, we present a framework of stage-wise generative adversarial networks (GANs), which compile new whistle data suitable for deep model training via three stages: generation of background noise in the spectrogram, generation of whistle contours, and generation of whistle signals. By separating the generation of different components in the samples, our framework composes visually promising whistle data and labels even when few expert annotated data are available. Regardless of the amount of human-annotated data, the proposed data augmentation framework leads to a consistent improvement in performance of the whistle extraction model, with a maximum increase of 1.69 in the whistle extraction mean F1-score. Our stage-wise GAN also surpasses one single GAN in improving whistle extraction models with augmented data.

Abstract:
Unsupervised Domain Adaptation (UDA) aims to leverage knowledge of a well-labeled source domain to learn an effective classifier for an unlabeled target domain. However, a common scenario in real-world applications is that the target domain contains unknown categories that are not observed in the source domain. This setting is termed as open set domain adaptation (OSDA). Most existing approaches of OSDA can only classify known classes well but fail to recognize unknown samples effectively. In this paper, we propose an effective method, named manifold regularized joint transfer (MRJT), for OSDA. MRJT learns new feature representations by simultaneously reducing distribution discrepancy between domains, increasing compactness of within-class, discriminating different known classes, and distinguishing the unknown from the known. The learned new features are projected onto reproducing kernel Hilbert space. In this space, a weighted structural risk minimization method is integrated with manifold regularization to utilize geometric information sufficiently to learn an effective classifier. Extensive experimental results on four real-world datasets verify the superiority of our method. It can not only classify known samples into the right known classes but also recognize unknown samples effectively.

Abstract:
LiDAR-assisted visual odometry (VO) is a widely-used solution for pose estimation and mapping. However, most existing LiDAR-assisted VO systems could suffer from the problems of 1) lacking distinctive and evenly distributed pixels for tracking due to the sparsity of LiDAR points and limited FOV overlap between a camera and LiDAR, and 2) nontrivial errors when processing LiDAR point clouds. To address above problems, we present CR-LDSO, a direct sparse LiDAR-assisted VO with the core parts being: 1) a novel cloud reusing method with point extraction/re-extraction to increase both the camera-LiDAR FOV overlap and the number of high-quality tracking pixels and 2) an occlusion removal method to exclude mismatching pixels due to occluded 3D object from sliding-window optimization and a point extraction strategy without depth interpolation. Extensive experimental results on public datasets demonstrates the superiority of our method to the existing state-of-the-art methods.

Abstract:
Research on image dehazing has made the need for a suitable dehazed image quality assessment (DIQA) method even more urgent. The performance of existing DIQA methods heavily relies on handcrafted haze-related features. Since hazy images with uneven haze density distributions will result in uneven quality distributions after dehazing, the manually extracted feature expression is neither accurate nor robust. In this paper, we design a deep CNN-based DIQA method without a handcrafted feature requirement. Specifically, we propose a blind dehazed image quality assessment model (BDQM), which consists of three components: image preprocessing, a haze-related feature extraction network (HFNet), and an improved regression network (IRNet). In HFNet, we design a perceptual information enhancement (PIE) module to learn powerful feature representations and enhance network capability according to channel attention, multiscale convolution and residual concatenation. IRNet aims to aggregate all patch information for the quality prediction of the whole image, where the effect of inhomogeneous distortion from the dehazing procedure is attenuated via a specifically designed patch attention (PA) mechanism. Experimental results on benchmark datasets demonstrate the effectiveness and superiority of the proposed network architecture over state-of-the-art methods.

Abstract:
Multimodal feature fusion aims to draw complementary information from different modalities to achieve better performance. Contrastive learning is effective at discriminating coexisting semantic features (positive) from irrelative ones (negative) in multimodal signals. However, positive and negative pairs learn at separate rates, which undermines the overall performance of multimodal contrastive learning (MCL). Moreover, the learned representation model is not robust, as MCL utilizes supervision signals from potentially noisy modalities. To address these issues, a novel multimodal contrastive learning objective, Pace-adaptive and Noise-resistant Noise-Contrastive Estimation (PN-NCE), is proposed for multimodal fusion by directly using unimodal features. PN-NCE encourages the positive and negative pairs reaching to their optimal similarity scores adaptively and shows less susceptibility to noisy inputs during training. A theoretical analysis is performed on its robustness. Maximizing modality invariance information in the fused representation is expected to benefit the overall performance and therefore, an estimator that measures the difference between the fused representation and its unimodal representations is integrated into MCL to obtain a more modality-invariant fusion output. The proposed method is model-agnostic and can be adapted to various multimodal tasks. It also bears less performance degradation when reducing the number of training samples at the linear probing stage. With different networks and modality inputs from three multi-modal datasets, experimental results show that PN-NCE achieves consistent enhancements compared with previous state-of-the-art approaches.

Abstract:
Open set action recognition (OSAR) is a rising research domain that simultaneously identifies all videos from known classes and rejects videos from unknown classes. Existing methods rarely consider the open set data distribution and the spatial-temporal relations of video subsequence. Recently proposed Capsule Network (CapsNet) has shown robust performance in many fields, especially image recognition. However, the current CapsNet has not been directly applied to the OSAR task since it cannot explicitly consider the data distribution of known and unknown classes along with the spatial-temporal relations for videos. This paper proposes the Spatial-Temporal Exclusive Capsule Network (STE-CapsNet) to solve the problems in the OSAR task. The STE-CapsNet designs the temporal-spatial routing mechanism to jointly capture the spatial-temporal information of the videos. Furthermore, the exclusive capsules are learned with dot product routing mechanism to limit the data distribution of closed set and open set and reduce the open set risk for OSAR. Extensive experimental results demonstrate that our proposed approach performs favorably compared with state-of-the-art methods on three standard datasets, which verifies its effectiveness and generalization ability.

Abstract:
The amount of multimedia content shared everyday, combined with the level of realism reached by recent fake-generating technologies, threatens to impair the trustworthiness of online information sources. The process of uploading and sharing data tends to hinder standard media forensic analyses, since multiple re-sharing steps progressively hide the traces of past manipulations. At the same time though, new traces are introduced by the platforms themselves, enabling the reconstruction of the sharing history of digital objects, with possible applications in information flow monitoring and source identification. In this work, we propose a supervised framework for the reconstruction of image sharing chains on social media platforms. The system is structured as a cascade of backtracking blocks, each of them tracing back one step of the sharing chain at a time. Blocks are designed as ensembles of classifiers trained to analyse the input image independently from one another by leveraging different feature representations that describe both content and container of the media object. Individual decisions are then properly combined by a late fusion strategy. Results highlight the advantages of employing multiple clues, which allow accurately tracing back up to three steps along the sharing chain.

Abstract:
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt’s effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.

Abstract:
Fusion and interaction of multimodal features are essential for video question answering. Structural information composed of the relationships between different objects in videos is very complex, which restricts understanding and reasoning. In this paper, we propose a quaternion hypergraph network (QHGN) for multimodal video question answering, to simultaneously involve multimodal features and structural information. Since quaternion operations are suitable for multimodal interactions, four components of the quaternion vectors are applied to represent the multimodal features. Furthermore, we construct a hypergraph based on the visual objects detected in the video. Most importantly, the quaternion hypergraph convolution operator is theoretically derived to realize multimodal and relational reasoning. Question and candidate answers are embedded in quaternion space, and a Q&A reasoning module is creatively designed for selecting the answer accurately. Moreover, the unified framework can be extended to other video-text tasks with different quaternion decoders. Experimental evaluations on the TVQA dataset and DramaQA dataset show that our method achieves state-of-the-art performance.

Abstract:
With the development of online real estate trading platforms, multi-modal housing trading data, including structural information, location, and interior image data, are being accumulated. The accurate appraisal of real estate makes sense for government officials, urban policymakers, real estate sellers, and personal purchasers. In this study, we propose an interpretable multi-modal stacking-based ensemble learning (IMSEL) method that deals with various modalities for real estate appraisals. We crawl the structural and image data of real estate in Chengdu city, China from the nation's largest real estate transaction platform with the location information, including public services, within 2 km of the real estate using Baidu map. We then compare the predictive results from IMSEL with those from previous state-of-art methods in the literature in terms of the root mean square error, mean absolute percentage error, mean absolute error, and coefficient of determination (R2). The comparison results show that IMSEL outperformed the other methods. We verified the improvement of introducing a data transformation strategy and deep visual features through a 10-fold cross-validation. We also discuss the managerial implications of our research findings.

Abstract:
Video stabilization is the process of improving the video quality by removing annoying fluctuant motion caused by camera jittering. A key issue of a successful solution is the temporal adaptability to motion and the overall robustness with respect to different motion types. However, most previous methods usually produce non-motion adaptive stabilized videos. In other words, under-smoothing in slow motion segments and over-smoothing in rapid motion segments will be produced for complex shaky videos. To overcome these drawbacks, we propose a novel video stabilization approach using a motion morphological component (MMC) decomposition. Specifically, the observed motion is decomposed into three MMCs: low-frequency smoothed (LFS) motion, high-frequency compensatory (HFC) motion, and shaky motion. LFS motion helps to largely stabilize videos, and HFC motion helps to recover missing motion to deal with over-smoothing. Subsequently, we present an MMC-based model to retrieve the desired smoothed motion, in which weighted nuclear norm and autoregression priors are used for LFS motion, while a sparsity prior is adopted for HFC motion. In addition, we design an adaptive weight setting scheme to detect rapid motions and to calculate the optimal weights. Finally, we develop a stabilization algorithm under the Alternating Direction Method of Multipliers (ADMM) framework. Experimental results demonstrate that our method can achieve high-quality results compared with that of other state-of-the-art stabilization methods in terms of robustness and efficiency, both quantitatively and qualitatively.

Abstract:
Light field imaging can simultaneously capture the intensity and direction information of light rays in the real world. Light field image (LFI) with four-dimensional (4D) data suffers from quality degradation in the process of compression, reconstruction and processing. How to evaluate the visual quality of LFI is thought-provoking. This paper proposes a no-reference LFI quality assessment metric based on high-dimensional sparse transform. Firstly, LFI's sub-aperture gradient image array (SAGIA), which is still a 4D signal, is generated by high-pass filtering between adjacent SAIs. Then, SAGIA is transformed with 4D discrete cosine transform (4D-DCT). 4D-DCT coefficients of SAGIA can characterize the angular and spatial information of LFI. And the logarithmic amplitudes of the coefficients at the same position of SAGIA?s transformed 4D blocks are averaged as the coefficient energy. Subsequently, the 4D-DCT coefficients of SAGIA are divided into the spatial-angular frequency bands and spatial-angular orientation bands, and the corresponding energy features are extracted by converging the coefficient energy of the same band. In addition, the coefficients' amplitudes at the same position of blocks are fitted by the Weibull distribution. Then, the fitted parameters of each position are concatenated, and cropped with principal component analysis to obtain the compact features. Finally, the extracted features are pooled to predict the visual quality of the distorted LFIs. The experimental results demonstrate that the proposed method is more consistent with the subjective evaluation on three LFI databases, compared with the state-of-the-art image quality assessment methods and LFI quality assessment methods.

Abstract:
We focus on unsupervised representation learning for skeleton based action recognition. Existing unsupervised approaches usually learn action representations by motion prediction but they lack the ability to fully learn inherent semantic similarity. In this paper, we propose a novel framework named Prototypical Contrast and Reverse Prediction (PCRP) to address this challenge. Different from plain motion prediction, PCRP performs reverse motion prediction based on encoder-decoder structure to extract more discriminative temporal pattern, and derives action prototypes by clustering to explore the inherent action similarity within the action encoding. Specifically, we regard action prototypes as latent variables and formulate PCRP as an expectation-maximization (EM) task. PCRP iteratively runs (1) E-step as to determine the distribution of action prototypes by clustering action encoding from the encoder while estimating concentration around prototypes, and (2) M-step as optimizing the model by minimizing the proposed ProtoMAE loss, which helps simultaneously pull the action encoding closer to its assigned prototype by contrastive learning and perform reverse motion prediction task. Besides, the sorting can also serve as a temporal task similar as reverse prediction in the proposed framework. Extensive experiments on N-UCLA, NTU 60, and NTU 120 dataset present that PCRP outperforms main stream unsupervised methods and even achieves superior performance over many supervised methods. The codes are available at: https://github.com/LZUSIAT/PCRP.

Abstract:
Personalized image aesthetic assessment (PIAA) has recently become a hot topic due to its wide applications, such as photography, film, television, e-commerce, fashion design, and so on. This task is more seriously affected by subjective factors and samples provided by users. In order to acquire precise personalized aesthetic distribution by small amount of samples, we propose a novel user-guided personalized image aesthetic assessment framework. This framework leverages user interactions to retouch and rank images for aesthetic assessment based on deep reinforcement learning (DRL), and generates personalized aesthetic distribution that is more in line with the aesthetic preferences of different users. It mainly consists of two stages. In the first stage, personalized aesthetic ranking is generated by interactive image enhancement and manual ranking, meanwhile, two policy networks will be trained. These two networks will be trained iteratively and alternatively to facilitate the final personalized aesthetic assessment. In the second stage, these modified images are labeled with aesthetic attributes by one style-specific classifier, and then the personalized aesthetic distribution is generated based on the multiple aesthetic attributes of these images, which conforms to the aesthetic preference of users better. Compared with other existing methods, our approach has achieved new state-of-the-art in the task of personalized image aesthetic assessment on the public AVA and FLICKR-AES datasets.

Abstract:
Compared with uni-modal biometrics systems, multimodal biometrics systems using multiple sources of information for establishing an individual’s identity have received considerable attention recently. However, most traditional multimodal biometrics techniques generally extract features from each modality independently, ignoring the implicit associations between different modalities. In addition, most existing work uses hand-crafted descriptors that are difficult to capture the latent semantic structure. This paper proposes to learn the sparse and discriminative multimodal feature codes (SDMFCs) for multimodal finger recognition, which simultaneously takes into account the specific and common information among inter-modality and intra-modality. Specifically, given the multimodal finger images, we first establish the local difference matrix to capture informative texture features in local patches. Then, we aim to jointly learn discriminative and compact binary codes by constraining the observations from multiple modalities. Finally, we develop a novel SDMFC-based multimodal finger recognition framework, which integrates the local histograms of each division block in the learned binary codes together for classification. Experimental results on three commonly used finger databases demonstrate the effectiveness and robustness of the proposed framework in multimodal biometrics tasks.

Abstract:
Automatically generating the “impression” section of a radiology report given the “findings” section can summarize as much salient information of the “findings” section as possible, thus promoting more effective communication between radiologists and referring physicians. To significantly reduce the workload of radiologists, we develop and evaluate a novel framework of abstractive summarization methods to automatically generate the “impression” section of chest radiology reports. Despite recent advancements in natural language process (NLP) field such as BERT and its variants, existing abstractive summarization models and methods could not be directly applied to radiology reports, partly due to domain-specific radiology terminology. In response, we develop a pre-trained language model in the chest radiology domain, named ChestXRayBERT, to solve the problem of automatically summarizing chest radiology reports. Specifically, we first collect radiology-related scientific papers as pre-training corpus and pre-train a ChestXRayBERT on it. Then, an abstractive summarization model is proposed, which consists of the pre-trained ChestXRayBERT and a Transformer decoder. Finally, the model is fine-tuned on chest X-ray reports for the abstractive summarization task. When evaluated on the publicly available OPEN-I and MIMIC-CXR datasets, the performance of our proposed model achieves significant improvement compared with other neural networks-based abstractive summarization models. In general, the proposed ChestXRayBERT demonstrates the feasibility and promise of tailoring and extending advanced NLP techniques to the domain of medical imaging and radiology, as well as in the broader biomedicine and healthcare fields in the future.

Abstract:
Deep learning methods have shown outstanding performance in many applications, including single-image super-resolution (SISR). With residual connection architecture, deeply stacked convolutional neural networks provide a substantial performance boost for SISR, but their huge parameters and computational loads are impractical for real-world applications. Thus, designing lightweight models with acceptable performance is one of the major tasks in current SISR research. The objective of lightweight network design is to balance a computational load and reconstruction performance. Most of the previous methods have manually designed complex and predefined fixed structures, which generally required a large number of experiments and lacked flexibility in the diversity of input image statistics. In this paper, we propose a dynamic residual self-attention network (DRSAN) for lightweight SISR, while focusing on the automated design of residual connections between building blocks. The proposed DRSAN has dynamic residual connections based on dynamic residual attention (DRA), which adaptively changes its structure according to input statistics. Specifically, we propose a dynamic residual module that explicitly models the DRA by finding the interrelation between residual paths and input image statistics, as well as assigning proper weights to each residual path. We also propose a residual self-attention (RSA) module to further boost the performance, which produces 3-dimensional attention maps without additional parameters by cooperating with residual structures. The proposed dynamic scheme, exploiting the combination of DRA and RSA, shows an efficient trade-off between computational complexity and network performance. Experimental results show that the DRSAN performs better than or comparable to existing state-of-the-art lightweight models for SISR.

Abstract:
Multi-view clustering, which appropriately integrates information from multiple sources to reveal data’s inherent structure, is gaining traction in clustering. Though existing procedures have yielded satisfactory results, we observe that they have neglected the inherent local structure in the base kernels. This may cause adverse effects on clustering. To solve the problem, we introduce LF-MKC-LKA, a simple yet effective late fusion multiple kernel clustering with local kernel alignment maximisation approach. In particular, we first determine the nearest k neighbours in the average kernel space for each sample and record the information in the nearest neighbor indicator matrix. Then, the nearest neighbor indicator matrix can be used to generate local structure matrix of each sample. The local kernels of each view may then be generated using the local structure matrix, retaining just the highly confident local similarities for learning the intrinsic global manifold of data. They can also be utilised to keep the block diagonal structure and improve the robustness of the underlying kernels against noise.We input the local kernels of each view into the kernel k-means (KKM) algorithm and get the local base partitions. Finally, we use a three-step iterative optimization approach to maximize the alignment of the consensus partition using base partitions and a regularisation term. As demonstrated, a significant number of trials on 11 multi-kernel benchmark datasets have shown that the proposed LF-MKC-LKA is effective and efficient. A number of experiments are also designed to demonstrate the fast convergence, excellent performance, robustness and low parameter sensitivity of the algorithm. Our code can be find at https://github.com/TiejianZhang/TMM21-LF-MKC-LKA.

Abstract:
Early activity prediction, which aims to recognize class labels before actions are fully performed, is a very challenging task since partially observed action sequences contain insufficient class-discrimination information, and thus, many partial action sequences belonging to different categories may look very similar. Therefore, in this paper, we propose a novel guidance aware network (GA-Net) to boost the ability to distinguish different activities in diversified partially observed action sequences via metric learning. To mitigate the similarity problem of action segments at very early stages, the proposed guided metric learning module (GMLM) is able to encourage the feature extractor to mine class-discriminative information given partially observed sequences. Specifically, the GMLM is able to minimize the intraclass distance with a full-length guided direction approach and maximize the difference between interclass categories with different observation ratios. To enhance the similarities between the partial- and full-length sequences in the same action categories, we further introduce a distribution alignment module (DAM) that employs full-length guidance to pull the partially observed features closer to the global features. We evaluate our proposed method on three public human activity datasets and achieve competitive results compared with the state-of-the-art approaches.

Abstract:
Multi-object tracking (MOT) is an essential task in the computer vision field. With the fast development of deep learning technology in recent years, MOT has achieved great improvement. However, some challenges still remain, such as sensitiveness to occlusion, instability under different lighting conditions, and non-robustness to deformable objects, causing incorrect temporal associations. To address such common challenges in most of the existing trackers, in this paper, a tracklet booster (TBooster) algorithm is proposed to correct the association errors resulting from existing trackers. The correction of the association error from TBooster has two folds: split tracklets on potential ID-change positions and then connect multiple tracklets into one if they are from the same object. To achieve this goal, the TBooster consists of two components, i.e., Splitter and Connector. In Splitter, an architecture with stacked temporal dilated convolution blocks is employed for the splitting position prediction via label smoothing strategy with adaptive Gaussian kernels. In Connector, a multi-head self-attention-based encoder is exploited for the tracklet embedding, which is further used to connect tracklets into full tracks. We conduct sufficient experiments on MOT17 and MOT20 benchmark datasets and achieve promising results. Combined with the proposed tracklet booster, existing trackers can achieve large improvements on the IDF1 score, which shows the effectiveness of the proposed TBooster.

Abstract:
Person re-IDentification (re-ID) under various occlusions has been a long-standing challenge as person images with different types of occlusions often suffer from misalignment in image matching and ranking. Most existing methods tackle this challenge by aligning spatial features of body parts according to external semantic cues or feature similarities but this alignment approach is complicated and sensitive to noises. We design DRL-Net, a disentangled representation learning network that handles occluded re-ID without requiring strict person image alignment or any additional supervision. Leveraging transformer architectures, DRL-Net achieves alignment-free re-ID via global reasoning of local features of occluded person images. It measures image similarity by automatically disentangling the representation of undefined semantic components, e.g., human body parts or obstacles, under the guidance of semantic preference object queries in the transformer. In addition, we design a decorrelation constraint in the transformer decoder and impose it over object queries for better focus on different semantic components. To better eliminate interference from occlusions, we design a contrast feature learning technique (CFL) for better separation of occlusion features and discriminative ID features. Extensive experiments over occluded and holistic re-ID benchmarks show that the DRL-Net achieves superior re-ID performance consistently and outperforms the state-offi-the-art by large margins for occluded re-ID dataset.

Abstract:
Multi-drone multi-target tracking aims at collabo- ratively detecting and tracking targets across multiple drones and associating the identities of objects from different drones, which can overcome the shortcomings of single-drone object tracking. To address the critical challenges of identity association and target occlusion in multi-drone multi-target tracking tasks, we collect an occlusion-aware multi-drone multi-target tracking dataset named MDMT. It contains 88 video sequences with 39,678 frames, including 11,454 different IDs of persons, bicycles, and cars. The MDMT dataset comprises 2,204,620 bounding boxes, of which 543,444 bounding boxes contain target occlusions. We also design a multi-device target association score (MDA) as the evaluation criteria for the ability of cross-view target association in multi-device tracking. Furthermore, we propose a Multi-matching Identity Authentication network (MIA-Net) for the multi-drone multi-target tracking task. The local-global matching algorithm in MIA-Net discovers the topological relationship of targets across drones, efficiently solves the problem of cross-drone association, and effectively complements occluded targets with the advantage of multiple drone view mapping. Extensive experiments on the MDMT dataset validate the effectiveness of our proposed MIA-Net for the task of identity association and multi-object tracking with occlusions.

Abstract:
Occlusions areuniversal disruptions constantly present in the real world. Especially for sparse representations, such as human skeletons, a few occluded points might destroy the geometrical and temporal continuity critically affecting the results. Yet, the research of data-scarce recognition from skeleton sequences, such as one-shot action recognition, does not explicitly consider occlusions despite their everyday pervasiveness. In this work, we explicitly tackle body occlusions for Skeleton-based One-shot Action Recognition (SOAR). We mainly consider two occlusion variants: 1) random occlusions and 2) more realistic occlusions caused by diverse everyday objects, which we generate by projecting the existing IKEA 3D furniture models into the camera coordinate system of the 3D skeletons with different geometric parameters, (e.g., rotation and displacement). We leverage the proposed pipeline to blend out portions of skeleton sequences of the three popular action recognition datasets (NTU-120, NTU-60 and Toyota Smart Home) and formalize the first benchmark for SOAR from partially occluded body poses. This is the first benchmark which considers occlusions for data-scarce action recognition. Another key property of our benchmark are the more realistic occlusions generated by everyday objects, as even in standard recognition from 3D skeletons, only randomly missing joints were considered. We re-evaluate existing state-of-the-art frameworks for SOAR in the light of this new task and further introduce Trans4SOAR – a new transformer-based model which leverages three data streams and mixed attention fusion mechanism to alleviate the adverse effects caused by occlusions. While our experiments demonstrate a clear decline in accuracy with missing skeleton portions, this effect is smaller with Trans4SOAR, which outperforms other architectures on all datasets. Although we specifically focus on occlusions, Trans4SOAR additionally yields state-of-the-art in the standard SOAR without occlusion, surpassing the best published approach by 2.85% on NTU-120.

Abstract:
Few-shot face recognition under occlusion (FSFRO) aims to recognize novel subjects given only a few, probably occluded face images, and it is challenging and common in real-world scenarios. Unknown occlusions may deteriorate the class prototypes, while an occluded image in the support set may be critical for recognition if the query image is occluded. This motivates us to propose a novel Two-stream Prototype Learning Network (TSPLN) for FSFR under occlusions by simultaneously considering the quality of support images and their relevance to the query image. Specifically, we design a two-stream architecture, which mainly consists of a support-centered stream and query-centered stream, to learn the optimal class prototypes. The former stream is to reduce the negative impact of occluded images on the prototype. This is achieved by exploring the similarities between different images in the support set. In the query-centered stream, we exploit the relevance between the query and support set based on feature alignment (FA). We conduct extensive experiments on two popular datasets: CASIA-WebFace and RMFRD. The experimental results show that our proposed method achieves the state-of-the-art performance for occluded face recognition in the few-shot setting.

Abstract:
An SVS usually consists of four wide-angle fisheye cameras mounted around the vehicle to sense the surrounding environment. From the images synchronously captured by cameras, a top-down surround-view can be synthesized, on the premise that both intrinsics and extrinsics of the cameras have been calibrated. At present, the intrinsic calibration approach is relatively complete and can be pipelined, while the extrinsic calibration is still immature. To fill such a research gap, we propose a novel extrinsic self-calibration scheme which follows a weakly supervised framework, namely WESNet (Weakly-supervised Extrinsic Self-calibration Network). The training of WESNet consists of two stages. First, we utilize the corners in a few calibration site images as the weak supervision to roughly optimize the network by minimizing the geometric loss. Then, after the convergence in the first stage, we additionally introduce a self-supervised photometric loss term that can be constructed by the photometric information from natural images for further fine-tuning. Besides, to support training, we totally collected 19,078 groups of synchronously captured fisheye images under various environmental conditions. To our knowledge, thus far this is the largest surround-view dataset containing original fisheye images. By means of learning prior knowledge from the training data, WESNet takes the original fisheye images synchronously collected as the input, and directly yields extrinsics end-to-end with little labor cost. Its efficiency and efficacy have been corroborated by extensive experiments conducted on our collected dataset. To make our results reproducible, source code and the collected dataset have been released.1

Abstract:
Although many impressive works on learning-based camera ego-motion estimation methods have been proposed recently, most of them promote the accuracy of camera pose estimation by various sequential learning with loop closure optimization, while neglecting the improvement of PoseNet itself. In this paper, we focus on the coupling of rotation and translation in ego-motion estimation, and design a cascade decoupling structure to separately learn the rotation and translation of camera relative motion between adjacent frames. Meanwhile, a rigid-aware unsupervised learning framework with iterative pose refinement scheme is proposed for camera ego-motion estimation. It can disambiguate rigid motion and deformations in dynamic scenarios by jointly learning of optical flow, stereo disparity and camera pose. Validated with evaluation experiments on the public available datasets, our method is superior to the state-of-the-art unsupervised methods, and can achieve comparable results with the supervised ones.

Abstract:
In recent years, various face-landmark datasets have been published. Intuitively, it is significant to integrate multiple labeled datasets to achieve higher performance. Due to the different annotation schemes of datasets, it is hard to directly train models using them together. Although numerous efforts have been made in the joint use of datasets, there remain three shortages in previous methods, i.e., additional computation, limitation of the markups scheme, and limited support for the regression method. To solve the above issues, we proposed a novel Alternating Training Framework (ATF), which leverages the similarity and diversity across multiple datasets for a more robust detector. ATF mainly contains two sub-modules: Alternating Training with Decreasing Proportions (ATDP) and Mixed Branch Loss (\mathcal L_MB). In particular, ATDP trains multiple datasets simultaneously via a weakly supervised way to take advantage of the diversity among them, and \mathcal L_MB utilizes similar landmark pairs to constrain different branches of the corresponding datasets. Besides, we extend the framework to easily handle three situations: single target detector, joint detector, and novel detector. Extensive experiments demonstrate the effectiveness of our framework for both heatmap-based and direct coordinate regression. Moreover, we have achieved a joint detector that outperforms state-of-the-art methods on each benchmark.

Abstract:
In recent years, graph convolutional networks (GCNs) play an increasingly critical role in skeleton-based human action recognition. However, most GCN-based methods still have two main limitations: 1) They only consider the motion information of the joints or process the joints and bones separately, which are unable to fully explore the latent functional correlation between joints and bones for action recognition. 2) Most of these works are performed in the supervised learning way, which heavily relies on massive labeled training data. To address these issues, we propose a semi-supervised skeleton-based action recognition method which has been rarely exploited before. We design a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder to achieve semi-supervised learning. Specifically, the correlation-driven joint-bone fusion graph convolution (CD-JBF-GC) can explore the motion transmission between the joint stream and the bone stream, so as to promote both streams to learn more discriminative feature representations. The pose prediction based auto-encoder in the self-supervised training fashion allows the network to learn motion representation from the unlabeled data, which is essential for action recognition. Extensive experiments on two popular datasets, i.e. NTU-RGB+D and Kinetics-Skeleton, demonstrate that our model achieves the state-of-the-art performance for semi-supervised skeleton-based action recognition and is also useful for fully-supervised methods.

Abstract:
State-of-the-art approaches for crowd counting resort to deepneural networks to predict density maps. However, counting people in congested scenes remains a challenging task because the presence of drastic scale variation, density inconsistency, and complex background can seriously degrade their counting accuracy. To battle the ingrained issue of accuracy degradation, in this paper, we propose a novel and powerful network called Scale Tree Network (STNet) for accurate crowd counting. STNet consists of two key components: a Scale-Tree Diversity Enhancer and a Multi-level Auxiliator. Specifically, the Diversity Enhancer is designed to enrich scale diversity, which alleviates limitations of existing methods caused by insufficient level of scales. A novel tree structure is adopted to hierarchically parse coarse-to-fine crowd regions. Furthermore, a simple yet effective Multi-level Auxiliator is presented to aid in exploiting generalisable shared characteristics at multiple levels, allowing more accurate pixel-wise background cognition. The overall STNet is trained in an end-to-end manner, without the needs for manually tuning loss weights between the main and the auxiliary tasks. Extensive experiments on five challenging crowd datasets demonstrate the superiority of the proposed method.

Abstract:
We introduce a new photographing guidance (PhotoHelper) for amateur photographers to enhance their portrait photo quality using deep feature retrieval and fusion. In our model, we comprehensively integrate empirical aesthetic rules, traditional machine learning algorithms and deep neural networks to extract different kinds of features in both color and space aspects. With these features, we build a modified random forest with a structured photograph collection to identify types of photos. We also define the composition matching score to measure the similarity between the given photo and the reference photo. By combining all of the above processes, a one-stop deep portrait photographing guidance is constructed to provide users with professional reference photographs that are similar to the current scene and automatically generate spatial composition guidance according to the user-selected reference photo. Experiments and evaluations show that the aesthetic quality of portrait photos can be significantly improved via the composition guidance of our photographing guidance approach.

Abstract:
Online and offline handwritten Chinese text recognition (HTCR) has been studied for decades. Early methods adopted oversegmentation-based strategies but suffered from low speed, insufficient accuracy, and high cost of character segmentation annotations. Recently, segmentation-free methods based on connectionist temporal classification (CTC) and attention mechanism, have dominated the field of HCTR. However, people actually read text character by character, especially for ideograms such as Chinese. This raises the question: are segmentation-free strategies really the best solution to HCTR? To explore this issue, we propose a new segmentation-based method for recognizing handwritten Chinese text that is implemented using a simple yet efficient fully convolutional network. A novel weakly supervised learning method is proposed to enable the network to be trained using only transcript annotations; thus, the expensive character segmentation annotations required by previous segmentation-based methods can be avoided. Owing to the lack of context modeling in fully convolutional networks, we propose a contextual regularization method to integrate contextual information into the network during the training stage, which can further improve the recognition performance. Extensive experiments conducted on four widely used benchmarks, namely CASIA-HWDB, CASIA-OLHWDB, ICDAR2013, and SCUT-HCCDoc, show that our method significantly surpasses existing methods on both online and offline HCTR, and exhibits a considerably higher inference speed than CTC/attention-based approaches.

Abstract:
Live video traffic has been widely observed to vary significantly within short timescale. In order to manage such traffic dynamic of overlay live streaming, the Content Provider (CP) may deploy a set of geo-dispersed auto-scaling servers where the pay-as-you-go deployment cost is charged by the amount of resources used due to server uploading and data transmission between servers. To support geo-distributed user demands, we study a novel multi-origin multi-channel auto-scaling live streaming cloud that pushes each channel stream in the core network overlay as a tree covering the end servers who have local demand for the channel. The Origin-to-End (O2E) delay from an origin to an end server is due to the Server-to-Server (S2S) delays of the overlay links along the path. By optimizing the overlay of the core network, we seek to minimize the deployment cost and O2E delays of the channels (i.e., a bi-criteria problem), which can be equivalently phrased as minimizing the deployment cost while meeting certain given maximum O2E delay constraints. We formulate a realistic problem capturing the major cost and delay components, and show its NP-hardness. We propose Cost-optimized Multi-Origin Multi-Channel Overlay Streaming (COCOS), a novel, efficient and near-optimal bi-criteria approximation algorithm with proven approximation ratio. Trace-driven extensive experimental results based on real-world live streaming service data validate that COCOS outperforms other state-of-the-art schemes by a wide margin (cutting the cost in general by more than 50%).

Abstract:
Colored glass, which is commonly seen in modern city life, often degrades images taken through it with co-occurring reflection and color bias due to its optical property of simultaneous transmission, reflection, and wavelength-selective absorption. Recovering the clean background behind colored glass is inherently challenging due to the mutual interference of two degradations within a single mixture observation, and has barely been specifically considered by existing image restoration methods. In this paper, we aim at realizing faithful background scene recovery for an image taken in front of colored glass. We first analyze the formation model of mixed degradations caused by colored glass, and propose a cooperative framework to address the mutual interference problem, featuring a novel glass color invariant loss and progressive refinement. Besides, we propose a data synthesis strategy for network training. Experimental results on our newly collected real-world dataset show that our proposed method achieves state-of-the-art performance.

Abstract:
A good distortion representation is crucial for the success of deep blind image quality assessment (BIQA). However, most previous methods do not effectively model the relationship between distortions or the distribution of samples with the same distortion type but different distortion levels. In this work, we start from the analysis of the relationship between perceptual image quality and distortion-related factors, such as distortion types and levels. Then, we propose a Distortion Graph Representation (DGR) learning framework for IQA, named GraphIQA, in which each distortion is represented as a graph, i.e., DGR. One can distinguish distortion types by learning the contrast relationship between these different DGRs, and can infer the ranking distribution of samples from different levels in a DGR. Specifically, we develop two sub-networks to learn the DGRs: a) Type Discrimination Network (TDN) that aims to embed DGR into a compact code for better discriminating distortion types and learning the relationship between types; b) Fuzzy Prediction Network (FPN) that aims to extract the distributional characteristics of the samples in a DGR and predicts fuzzy degrees based on a Gaussian prior. Experiments show that our GraphIQA achieves state-of-the-art performance on many benchmark datasets of both synthetic and authentic distortions.

Abstract:
Light field (LF) data are widely used in the immersive representations of the 3D world. To record the light rays along with different directions, an LF requires much larger storage space and transmission bandwidth than a conventional 2D image with similar spatial dimension. In this paper, we propose a novel framework for light field image compression that leverages graph learning and dictionary learning to remove structural redundancies between different views. Specifically, to significantly reduce the bit-rates, only a few key views are sampled and encoded, whereas the remaining non-key views are reconstructed via the graph adjacency matrix learned from the angular patch. Furthermore, dictionary-guided sparse coding is developed to compress the graph adjacency matrices and reduce the coding overheads. To our best knowledge, this paper is the first to achieve compact representation of cross-view structural information via adaptive learning on graphs. Experimental results demonstrate that the proposed framework achieves better performance than the standardized HEVC-based codec.

Abstract:
With the growth of Extended Reality (XR) and capturing devices, point cloud representation has become attractive to academics and industry. Point Cloud Compression (PCC) algorithms further promote numerous XR applications that may change our daily life. However, in the literature, PCC algorithms are often evaluated with heterogeneous datasets, metrics, and parameters, making the results hard to interpret. In this article, we propose an open-source benchmark platform called PCC Arena. Our platform is modularized in three aspects: PCC algorithms, point cloud datasets, and performance metrics. Users can easily extend PCC Arena in each aspect to fulfill the requirements of their experiments. To show the effectiveness of PCC Arena, we integrate seven PCC algorithms into PCC Arena along with six point cloud datasets. We then compare the algorithms on ten carefully selected metrics to evaluate the quality of the output point clouds. We further conduct a user study to quantify the user-perceived quality of rendered images that are produced by different PCC algorithms. Several novel insights are revealed in our comparison: (i) Signal Processing (SP)-based PCC algorithms are stable for different usage scenarios, but the trade-offs between coding efficiency and quality should be carefully addressed, (ii) Neural Network (NN)-based PCC algorithms have the potential to consume lower bitrates yet provide similar results to SP-based algorithms, (iii) NN-based PCC algorithms may generate artifacts and suffer from long running time, and (iv) NN-based PCC algorithms are worth more in-depth studies as the recently proposed NN-based PCC algorithms improve the quality and running time. We believe that PCC Arena can play an essential role in allowing engineers and researchers to better interpret and compare the performance of future PCC algorithms.

Abstract:
Few-shot video classification (video FSL), which learns classifiers for novel concepts, has gained increasing attention in the last few years from only a few samples. The existing methods rarely consider the local-global relation for video feature learning, which would ultimately result in low discriminative ability. Recently, the capsule network (CapsNet) has shown considerable potential in local-global relation learning in the image analysis field. However, CapsNet cannot be directly applied in video FSL since it ignores the interaction between videos and has high computational complexity. In this paper, a dual-routing capsule graph neural network (DR-CapsGNN) is proposed to solve the above issues. The DR-CapsGNN leverages CapsNet and a graph neural network (GNN) to explore local-global relations and to preserve the detailed properties. Specifically, the CapsGNN is used to learn video relations and structural information to generate high-quality hierarchical capsules. Furthermore, a novel dual-routing mechanism is designed to filter low-discriminative capsules from a holistic perspective and achieves high efficiency, which consists of inter-video and intra-video routing. Extensive experimental results demonstrate that our proposed approach performs favorably compared to state-of-the-art methods on two popular benchmarks.

Abstract:
The goal of multi-view learning is to learn latent patterns from various data sources. Most of previous research focused on fitting feature embedding in target tasks. There is very limited research on the connection between feature representations with hidden layers of neural networks. In this paper, a multi-view deep matrix factorization model is proposed to learn a shared feature representation. The proposed model automatically explores the most discriminative features of multi-view data and makes these features meet the requirements of specific applications. Here we explore the connection between deep learning and feature representations. First, the model constructs a scalable neural network with shared hidden layers for exploring a low-dimensional representations of all views. Second, the quality of representation matrix is evaluated via relaxed graph regularization and evaluators to improve the feature representation capability of matrix factorization. Finally, the effectiveness of the proposed method is verified through comparative experiments with eight state-of-the-art multi-view clustering algorithms on eight real-world datasets.

Affiliations: Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; Center for Mathematical Artificial Intelligence, Department of Mathematics, The Chinese University of Hong Kong, Hong Kong; Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan; Department of Mechanical and Control Engineering, Kyushu Institute of Technology, Kitakyushu, Japan; School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China

Abstract:
Real-time semantic segmentation, which can be visually understood as the pixel-level classification task on the input image, currently has broad application prospects, especially in the fast-developing fields of autonomous driving and drone navigation. However, the huge burden of calculation together with redundant parameters are still the obstacles to its technological development. In this article, we propose a Fast Bilateral Symmetrical Network (FBSNet) to alleviate the above challenges. Specifically, FBSNet employs a symmetrical encoder-decoder structure with two branches, semantic information branch and spatial detail branch. The Semantic Information Branch (SIB) is the main branch with semantic architecture to acquire the contextual information of the input image and meanwhile acquire sufficient receptive field. While the Spatial Detail Branch (SDB) is a shallow and simple network used to establish local dependencies of each pixel for preserving details, which is essential for restoring the original resolution during the decoding phase. Meanwhile, a Feature Aggregation Module (FAM) is designed to effectively combine the output of these two branches. Experimental results of Cityscapes and CamVid show that the proposed FBSNet can strike a good balance between accuracy and efficiency. Specifically, it obtains 70.9% and 68.9% mIoU along with the inference speed of 90 fps and 120 fps on these two test datasets, respectively, with only 0.62 million parameters on a single RTX 2080Ti GPU. The code is available at https://github.com/IVIPLab/FBSNet.

Abstract:
Learning models that can generalize to previously unseen domains to which we have no access is a fundamental yet challenging problem in machine learning. In this paper, we propose meta variational inference (MetaVI), a variational Bayesian framework of meta-learning for cross domain image classification. Within the meta learning setting, MetaVI is derived to learn a probabilistic latent variable model by maximizing a meta evidence lower bound (Meta ELBO) for knowledge transfer across domains. To enhance the discriminative ability of the model, we further introduce a Wasserstein distance based constraint to the variational objective, leading to the Wasserstein MetaVI, which largely improves classification performance. By casting into a probabilistic inference problem, MetaVI offers the first, principled variational meta-learning framework for cross domain learning. In addition, we collect a new visual recognition dataset to contribute a more challenging benchmark for cross domain learning, which will be released to the public. Extensive experimental evaluation and ablation studies on four benchmarks show that our Wasserstein MetaVI achieves new state-of-the-art performance and surpasses previous methods, demonstrating its great effectiveness.

Abstract:
Re-ranking utilizes contextual information to optimize the initial ranking list of person or vehicle re-identification (re-ID), which boosts the retrieval performance at post-processing steps. This paper proposes a re-ranking network to predict the correlations between the probe and top-ranked neighbor samples. Specifically, all the feature embeddings of query and gallery images are expanded and enhanced by a linear combination of their neighbors, with the correlation prediction serving as discriminative combination weights. The combination process is equivalent to moving independent embeddings toward the identity centers, improving cluster compactness. For correlation prediction, we first aggregate the contextual information for probe's k-nearest neighbors via the Transformer encoder. Then, we distill and refine the probe-related features into the Contextual Memory cell via attention mechanism. Like humans that retrieve images by not only considering probe images but also memorizing the retrieved ones, the Contextual Memory produces multi-view descriptions for each instance. Finally, the neighbors are reconstructed with features fetched from the Contextual Memory, and a binary classifier predicts their correlations with the probe. Experiments on six widely-used person and vehicle re-ID benchmarks demonstrate the effectiveness of the proposed method. Especially, our method surpasses the state-of-the-art re-ranking approaches on large-scale datasets by a significant margin, i.e., with an average 4.83% CMC@1 and 14.83% mAP improvements on VERI-Wild, MSMT17, and VehicleID datasets.

Abstract:
Seeking reliable correspondences then recovering camera poses from a set of putative correspondences extracted from two images of the same scene is a fundamental problem in computer vision. Recent advances have demonstrated that this problem can be effectively solved by using a deep architecture based on the multi-layer perceptron, where the context normalization is designed to make the network permutation-equivariant and embed global information in the sparse point data. However, the context normalization simply normalizes the feature maps according to their distribution and treats each correspondence equally, leading to difficulties in adequately capturing scene geometry encoded by the inliers, especially in case of severe outliers. To address this issue, this paper designs a context-sensitive network based on the self-attention mechanism, termed as correspondence attention transformer (CAT), to enhance the consistent geometry information of inliers and simultaneously suppress outliers during embedding global information. In particular, we design an attention-style structure to aggregate features from all correspondences, i.e., a spatial attention namely CAT-S, which provides each correspondence with information exchange from others in the putative set. To capture the contextual information in a more comprehensive and robust way, we also introduce a multi-head mechanism in our structure to exploit the geometrical context from different aspects. Moreover, considering the high memory request in spatial attention, we propose a covariance normalized channel attention CAT-C in our framework, which can largely reduce the memory consumption and parameter scale, but it asks for eigenvalue decomposition in each attention block thus resulting in more runtime. Anyway, these two attention mechanisms can realize information exchange from the spatial or channel aspect, which both contribute to constructing the geometrical context between inliers and encourage the network to pay more attention to the feature subset about potential inliers. Extensive experiments have been conducted over both indoor and outdoor datasets on the tasks of camera pose estimation, outlier removal, and image registration, which demonstrate the superiority of our method that realizes a large performance improvement compared with the current state-of-the-art approaches.

Abstract:
Existing researches on handwritten Chinese characters are mainly based on recognition network designed to solve the complex structure and numerous amount characteristics of Chinese characters. In this paper, we investigate Chinese characters from the perspective of error correction, which is to diagnose a handwritten character to be right or wrong and provide a feedback on error analysis. For this handwritten Chinese character error correction task, we define a benchmark by unifying both the evaluation metrics and data splits for the first time. Then we design a diagnosis system that includes decomposition, judgement and correction stages. Specifically, a novel tree-structure analysis network (TAN) is proposed to model a Chinese character as a tree layout, which mainly consists of a CNN-based encoder and a tree-structure based decoder. Using the predicted tree layout for judgement, correction operation is performed for the wrongly written characters to do error analysis. The correction stage is composed of three steps: fetch the ideal character, correct the errors and locate the errors. Additionally, we propose a novel bucketing mining strategy to apply triplet loss at radical level to alleviate feature dispersion. Experiments on handwritten character dataset demonstrate that our proposed TAN shows great superiority on all three metrics comparing with other state-of-the-art recognition models. Through quantitative analysis, TAN is proved to capture more accurate spatial position information than regular encoder-decoder models, showing better generalization ability.

Affiliations: School of Information and Communication Engineering, Xi’an Jiaotong University, Xi’an, China; Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China; Space Precision Measurement Laboratory, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi'an, China; School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China; Ministry of Education Key Laboratory for Intelligent Networks and Network Security, School of Information and Communications Engineering, and SMILES LAB, Xi’an Jiaotong University, Xi’an, China

Abstract:
Depth estimation aims to predict depth map from RGB images without high cost equipments. Deep learning based depth estimation methods have shown their effectiveness. However in existing methods, depth information is represented by a per-pixel depth map. Such depth map representation is fragile facing different kinds of depth changes. This paper proposes a Compressive Sensing based Depth Representation (CSDR) scheme, which formulates the problem of depth estimation in pixel space into the task of fixed-length vector regression in representation space. In this way, deep model training errors will not directly interfere depth estimation, and distortions in estimated depth maps can be restrained in the greatest extent. In addition, we improve depth estimation from two other aspects: model structure and loss function. To capture the features in different scales, we propose a Multiscale Encoder & Multiscale Decoder (MEMD) structure as the vector regression model. To further deal with depth change, we also modify the loss function, where the curvature difference between ground truth and estimation is directly incorporated. With the support of CSDR, MEMD and the curvature loss, the proposed approach achieves superior performance on a challenging depth estimation dataset: NYU-Depth-v2. A range of experiments support our claim that regression in CSDR space performs better than traditionally direct depth map estimation in pixel space.

Abstract:
Multi-Label Image Classification (MLIC) is a fundamental yet challenging task which aims to recognize multiple labels from given images. The key to solve MLIC lies in how to accurately model the correlation between labels. Recent studies often adopt Graph Convolutional Network (GCN) to model label dependencies with word embeddings as prior knowledge. However, classical word embeddings typically contain redundant information due to the imperfect distributional hypothesis it relies on, which may degrade model generalizability. To tackle this problem, we propose a novel deep learning framework termed Visual-Semantic based Graph Convolutional Network (VSGCN), which alleviates the negative impact of redundant information by utilizing heterogeneous sources of prior knowledge. Specifically, we construct both visual prototype and semantic prototype for each label as heterogeneous prior label representations, which are further mapped to multi-label classifiers via two Multi-Head GCNs separately. The Multi-Head GCN mechanism proposed in this paper aims to guide the information propagation between prototypes for each label, which constructs multiple correlation graphs to simultaneously model the label correlation in different subspaces. Notably, we alleviate the negative influence of needless information by decreasing the inconsistency of predictions that come from visual space and semantic space. Extensive experiments conducted on various multi-label image datasets demonstrate the superiority of our proposed method.

Abstract:
One trend in the latest bottom-up approaches for arbitrary-shape scene text detection is to determine the links between text segments using Graph Convolutional Networks (GCNs). However, the performance of these bottom-up methods is still inferior to that of state-of-the-art top-down methods even with the help of GCNs. We argue that a cause of this is that bottom-up methods fail to make proper use of visual-relational features, which results in accumulated false detection, as well as the error-prone route-finding used for grouping text segments. In this paper, we improve classic bottom-up text detection frameworks by fusing the visual-relational features of text with two effective false positive/negative suppression (FPNS) mechanisms and developing a new shape-approximation strategy. First, dense overlapping text segments depicting the “characterness” and “streamline” properties of text are constructed and used in weakly supervised node classification to filter the falsely detected text segments. Then, relational features and visual features of text segments are fused with a novel Location-Aware Transfer (LAT) module and Fuse Decoding (FD) module to jointly rectify the detected text segments. Finally, a novel multiple-text-map-aware contour-approximation strategy is developed based on the rectified text segments, instead of the error-prone route-finding process, to generate the final contour of the detected text. Experiments conducted on five benchmark datasets demonstrate that our method outperforms the state-of-the-art performance when embedded in a classic text detection framework, which revitalizes the strengths of bottom-up methods.

Abstract:
Learning effective joint embedding for cross-modal data has always been a focus in the field of multimodal machine learning. We argue that during multimodal fusion, the generated multimodal embedding may be redundant, and the discriminative unimodal information may be ignored, which often interferes with accurate prediction and leads to a higher risk of overfitting. Moreover, unimodal representations also contain noisy information that negatively influences the learning of cross-modal dynamics. To this end, we introduce the multimodal information bottleneck (MIB), aiming to learn a powerful and sufficient multimodal representation that is free of redundancy and to filter out noisy information in unimodal representations. Specifically, inheriting from the general information bottleneck (IB), MIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target and simultaneously constraining the mutual information between the representation and the input data. Different from general IB, our MIB regularizes both the multimodal and unimodal representations, which is a comprehensive and flexible framework that is compatible with any fusion methods. We develop three MIB variants, namely, early-fusion MIB, late-fusion MIB, and complete MIB, to focus on different perspectives of information constraints. Experimental results suggest that the proposed method reaches state-of-the-art performance on the tasks of multimodal sentiment analysis and multimodal emotion recognition across three widely used datasets.

Abstract:
For decades, gait has been gathering extensive interest due to the advantage that it can be measured from a distance without physical contact. However, for image/video-based gait recognition, its performance can be remarkably influenced by exterior factors, such as viewing angles and clothing changes. Thus, in this paper, a group-supervised disentangled representation learning network is proposed for gait recognition to extract features invariant to these factors. First, sequences are explicitly disentangled into pose, gait, appearance, and view features through a generic encoder-decoder framework. To ensure feature adaptability and independency, a disentanglement swap module is specifically adopted during our encoder-decoder process through a series of swap operations based on the feature attributes. Following the feature disentanglement, a disentanglement aggregation module is also specially proposed for pose, gait, and appearance features to enhance their effectiveness. Finally, the enhanced three features are concatenated together for gait recognition. Relevant experiments certify that compared with other disentangled representation learning-based gait recognition methods, our proposed method enables a more excellent recognition result, despite fewer gait frames being utilized.

Abstract:
Viewport-adaptive streaming approaches are emerging as the most promising way to deliver high-quality 360^\circ videos. The viewport prediction techniques are developed to reduce bandwidth waste and improve users’ Quality of Experience (QoE). However, the viewport prediction result is only reliable with a short prediction window, i.e., a short playback buffer, which conflicts with maintaining a long buffer to minimize the stall ratio. To deal with this problem, we present RAM360, a Robust Adaptive Multi-layer 360^\circ video streaming system, to ensure high viewport quality and low stall ratio concurrently. We make three technical contributions. First, we design a QoE-driven robust multi-layer streaming framework, where each chunk is encoded into multiple independent layers with different quality levels. The client can dynamically decide which chunk and which layer to download according to their QoE contributions. Thus, the client can enhance the low-quality chunks (including the mistakenly predicted ones) in time to improve the viewport quality. Meanwhile, the client can adaptively download new chunks to the buffer to decrease the risk of stall. Second, we establish a novel model as users’ QoE metric throughout the playback progress, aiming to guide the client’s download theoretically. Third, we utilize the Lyapunov optimization theory to solve the QoE optimization problem online while assuring our algorithm’s near-optimality. We demonstrate that RAM360 can significantly outperform the existing schemes regarding the QoE (related to viewport quality, viewport quality oscillation, and stall ratio) through extensive experiments on public datasets.

Abstract:
Deep neural networks are vulnerable to adversarial examples which are crafted by adding small perturbations on benign examples. However, most existing attack methods often perform a poor transferability to attack black-box models, especially to attack defense methods. In addition, perturbations added to semantically irrelevant regions of benign examples are usually inefficient for attacks. To address these issues, we propose a transferable adversarial belief attack with salient region perturbation restriction method, which improves transferability of adversarial examples and decreases the amount of perturbations significantly. Specifically, we first design a salient-region-based perturbation restriction strategy to restrict the range of perturbations into a salient region. After that, we present a transferable belief attack method to generate the adversarial examples iteratively. Besides, our method can be easily integrated with other gradient-based transfer attack methods to further enhance the transferability of adversarial examples. Extensive experiments on the ImageNet dataset show that our method achieves higher transferability with lower perturbations than the state-of-the-art attack methods.

Abstract:
Weakly-supervised temporal action localization (WTAL) is a challenging task in understanding untrimmed videos, in which no frame-wise annotation is provided during training, only the video-level category label is available. Current methods mainly adopt temporal attention branches to conduct foreground-background separation with RGB and optical flow features simply concatenated, regardless of the discriminative spacial features and the complementarity between different modalities. In this work, we propose a Multi-Dimensional Attention (MDA) method to explore attention mechanism across three dimensions in weakly supervised action localization, i.e., 1) temporal attention that focuses on segments containing action instances, 2) channel attention that discovers the most relevant cues for action description, and 3) modal attention that fuses RGB and flow information adaptively based on feature magnitudes during background modeling. In addition, we introduce a similarity constraint loss to refine the action segment representation in feature space, which helps the network to detect less discriminative frames of an action to capture the full action boundaries. The proposed MDA with similarity constraints can be easily applied to existing action detection frameworks with few parameters. Extensive experiments on THUMOS’14 and ActivityNet v1.2 datasets show that the proposed method outperforms the current state-of-the-art WTAL approaches, and achieves comparable results with some advanced fully-supervised methods.

Affiliations: Anqing Normal University and Jiangxi University of Finance and Economics, Anqing, China; Center for Research on Intelligent Perception and Computing (CRIPAC), National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing, China; School of Information Management, Jiangxi University of Finance and Economics, Nanchang, China; Shanghai Institute for Advanced Communication and Data Science, School of Communication and Information Engineering, Shanghai University, Shanghai, China

Abstract:
Recently, many view synthesis-based methods are proposed for high-efficiency light field (LF) image compression. However, most existing methods fail to recover more texture details on occlusion regions, which reduces the compression efficiency. In this paper, we propose a multi-stream dense view reconstruction network to further improve LF image compression performance. In our method, only sparsely-sampled LF views are transmitted and the rest of the views are reconstructed at the decoder side. During the reconstruction process, we firstly constitute a multi-disparity geometry (MDG) structure based on the decoded sparse LF views, which can reflect abundant disparity characteristics. Subsequently, a multi-stream view reconstruction network (MSVRNet) is put forward to reconstruct a high-quality dense LF image, which consists of a multi-scale feature fusion sub-network, a fusion reconstruction sub-network, and a detail refinement sub-network. The multi-scale feature fusion sub-network can implicitly lean abundant multiscale geometric structure features from the constituted MDG structure. The fusion reconstruction sub-network and the detail refinement sub-network are respectively utilized to fuse the learned multiscale geometric features and restore more texture details, especially for occlusion regions. Moreover, 3D convolutional operations are adopted in the whole reconstruction process, which allow information propagation among the learned multiscale geometric features. Comprehensive experimental results demonstrate the effectiveness of the proposed method. The perceptual quality of reconstructed views and application on depth estimation also demonstrate that the proposed method can keep structural consistency of the reconstructed LF image and recover more texture details.

Abstract:
Skeleton data carries valuable motion information and is widely explored in human action recognition. However, not only the motion information but also the interaction with the environment provides discriminative cues to recognize the action of persons. In this paper, we propose a joint learning framework for mutually assisted “interacted object localization” and “human action recognition” based on skeleton data. The two tasks are serialized together and collaborate to promote each other, where preliminary action type derived from skeleton alone helps improve interacted object localization, which in turn provides valuable cues for the final human action recognition. Besides, we explore the temporal consistency of interacted object as constraint to better localize the interacted object with the absence of ground-truth labels. Extensive experiments on the datasets of SYSU-3D, NTU60 RGB+D, Northwestern-UCLA and UAV-Human show that our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition. Visualization results show that our method can also provide reasonable interacted object localization results.

Abstract:
Video captioning aims to generate natural language descriptions for a given video, which is a more challenging task than static image captioning since it requires a more diverse and exhaustive result. Meanwhile, it is also important that the generated captions should be consistent with the language habits of people at a fine granularity. In this work, unlike most recent works enhancing performance with additional data modalities or complex model designs, we focus on optimizing the training process of video captioning models. Firstly, to generate a more diverse video caption, we propose the bidirectional maximum entropy (BME) training, which directly optimizes the probability distribution of neighboring words under a reinforcement learning (RL) framework. Secondly, to search for more human-like captions in the larger search space created by BME, we introduce the word co-occurrence (WCO) weighting. It adaptively guides RL algorithms with co-occurrence statistics in the training corpus. Our method can be deployed on existing captioning models in a plug-and-play manner without introducing any extra parameters. Experimental results show that our method yields up to 5.8% and 7.0% improvements considering the CIDEr score on MSVD and MSR-VTT, respectively.

Abstract:
Inspired by the powerful representation capability of deep neural networks, deep cross-modal hashing methods have recently drawn much attention and various deep cross-modal hashing methods have been developed. However, two key problems have not been solved well yet: 1) With advanced neural network models, how to seek the multi-modal alignment space which can effectively model the intrinsic multi-modal correlations and reduce the heterogeneous modality gaps. 2) How to effectively and efficiently preserve the modelled multi-modal semantic correlations into the binary hash codes under the deep learning paradigm. In this paper, we propose a Hierarchical Message Aggregation Hashing (HMAH) method within an efficient teacher-student learning framework. Specifically, on the teacher end, we develop hierarchical message aggregation networks to construct a multi-modal complementary space by aggregating the semantic messages hierarchically across different modalities, which can better align the heterogeneous modalities and model the fine-grained multi-modal correlations. On the student end, we train a couple of student modules that learn hash functions to support cross-modal retrieval. We design a cross-modal correlation knowledge distillation strategy which seamlessly transfers the modelled fine-grained multi-modal semantic correlations from the teacher to the lightweight student modules. With the fine-grained knowledge supervision from teacher module, the semantic representation capability of hash functions can be enhanced. In addition, the whole learning framework avoids the time-consuming finetuning on the pre-trained deep models as existing methods and it is computationally efficient. Experimental results demonstrate the significant performance improvement of the proposed method on both retrieval accuracy and efficiency, compared with the state-of-the-art deep cross-modal hashing methods.

Abstract:
Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which extracts sound intervals from network prediction, for further improvement in sound event detection performance. The proposed approach is evaluated on sound event detection task for the DCASE2020 challenge. The results of these methods on both “validation” and “public evaluation” sets of DESED database show significant improvement compared to the state-of-the art systems in semi-supervised learning.

Abstract:
The lack of sufficient training data has been one obstacle to fine-grained visual classification research because labeling subcategories generally requires specialist knowledge. As one optional approach to alleviating the data-hunger problem, leveraging web images as training data is drawing increasing attention. Nevertheless, web images potentially have false labels, which can misguide the training process. Although several works have been proposed to deal with label noise, it still can be difficult for the network to tackle complex real-world noisy labels without any prior knowledge. In the literature, we propose to leverage a small and clean meta-set to provide reliable prior knowledge for tackling noisy web images. Specifically, our method trains a network with two peer predicting heads, which learn from noisy web images (web head) and meta ones (meta head), respectively. The meta head produces pseudo soft labels for web images to revise their training loss, which can overcome the high noise ratio problem. Furthermore, a selection net is trained in a meta-learning strategy to identify in- and out-of-distribution noisy images. Then in-distribution ones are reused for training with pseudo soft labels produced by the meta head as supervision, while out-of-distribution ones are discarded. In this manner, the misguidance caused by label noise is remarkably alleviated and in-distribution noisy samples are properly exploited to boost model performance. The superiority of our proposed approach is demonstrated by mathematical theory with great interpretability as well as extensive experimental results on the real-world dataset WebFG-496.

Abstract:
Video Question Answering (VideoQA), aiming to correctly answer a given question based on understanding multimodal video content, is challenging due to the richness of the video content. From the perspective of video understanding, a complete VideoQA framework needs to understand the video content at different semantic levels and flexibly integrate diverse video content to distill question-related content. To this end, we propose a Lightweight Visual-Linguistic Reasoning framework named \textLiVLR. Specifically, \textLiVLR first utilizes graph-based visual and linguistic encoders to obtain multi-grained visual and linguistic representations, respectively. Subsequently, the obtained representations are integrated with the devised Diversity-aware Visual-Linguistic Reasoning module (\textDaVL). \textDaVL distinguishes different types of representations with the learnable index embedding in graph embedding. Therefore, \textDaVL can flexibly adjust the importance of different representations when generating the question-related joint representation. The proposed \textLiVLR is lightweight and shows its performance advantage on three VideoQA benchmarks, MRSVTT-QA, KnowIT VQA, and TVQA. Extensive ablation studies demonstrate the effectiveness of the key components of \textLiVLR.

Abstract:
Recent advances in 3D modeling software and 3D capture devices contribute to the availability of large-scale 3D objects. Together with the prevalence of deep neural networks (DNNs), DNN-based 3D object retrieval systems are widely applied, especially by inputting 2D images to retrieve 3D objects. Although DNNs have shown vulnerable to adversarial attacks in classification, the vulnerability of DNN-based 3D object retrieval system remains under-explored. In this paper, we formulate the problem of attacking against DNN-based feature extractors in the 2D image-based 3D object retrieval system. Specifically, we consider the attack happens under a reasonable scenario that the candidate 3D object database is unknown to the adversary, which challenges adversarial example generation. To tackle this difficulty, we set up a reasonable hypothesis on the information which the adversary can be accessible, and then propose two effective perturbation generation methods: one is to corrupt domain-level alignment (CDA) and the other one is to corrupt class-level alignment (CCA). In converse, we propose a novel progressive adversarial training (PAT) method to improve the feature extractor robustness, which can effectively and stably mitigate both CDA and CCA attacks. Experimental results demonstrate that a typical feature extractor can be effectively compromised by attacks. Moreover, the transferability of the adversarial query illustrates the possibility of realistic black-box attacks. The successful defense against both CDA and CCA attacks by PAT can validate the superiority of the proposed defense method.

Abstract:
The ability to deal with intra and inter-modality features has been critical to the development of RGB-D salient object detection. While many works have advanced in leaps and bounds in this field, most existing methods have not taken their way down into the inherent differences between the RGB and depth data due to widely adopted conventional convolution in which fixed parameter kernels are applied during inference. To promote intra and inter-modality interaction conditioned on various scenarios, as RGB and depth data are processed independently and later fused interactively, we develop a new insight and a better model. In this paper, we introduce a criss-cross dynamic filter network by decoupling dynamic convolution. First, we propose a Model-specific Dynamic Enhanced Module (MDEM) that dynamically enhances the intra-modality features with global context guidance. Second, we propose a Scene-aware Dynamic Fusion Module (SDFM) to realize dynamic feature selection between two modalities. As a result, our model achieves accurate predictions of salient objects. Extensive experiments demonstrate that our method achieves competitive performance over 28 state-of-the-art RGB-D methods on 7 public datasets.

Abstract:
Previous 3D object reconstruction methods from 2D images involve two issues: the lack of in-depth exploration of the prior knowledge of 3D shapes, and the difficulty of dealing with the serious occluded parts. Inspired by human’s perception on real-world objects which is composed of an overall impression (known as shape impression) and an enhanced cognition, we propose a deep network (denoted by DASI) to learn the Domain Adaptive Shape Impression for 3D reconstruction from arbitrary view images. DASI consists of two modules: shape reconstruction module and shape refinement module. The former module reconstructs a coarse volume by learning a domain adaptive shape impression as embedding in image-based reconstruction. We first leverage 3D objects to learn a shape impression being associated with prior knowledge of 3D objects. To attain consensus on shape impression from 2D images, we regard the 3D shape and the 2D image as two different domains. By adapting the two domains, the shape impression learned from 3D objects is transferred to 2D images and guides the images-based reconstruction. The latter module refines the objects by modeling the whole 3D volume to local 3D patches and exploring their intrinsic geometry relationships. Quantitative and qualitative experimental results on two benchmark datasets demonstrate that DASI outperforms several state-of-the-arts for 3D reconstruction from single and multi-view 2D images.

Abstract:
Infrared and visible image fusion is aims to generate a composite image that can simultaneously describe the salient target in the infrared image and texture details in the visible image of the same scene. Since deep learning (DL) exhibits great feature extraction ability in computer vision tasks, it has also been widely employed in handling infrared and visible image fusion issue. However, the existing DL-based methods generally extract complementary information from source images through convolutional operations, which results in limited preservation of global features. To this end, we propose a novel infrared and visible image fusion method, i.e., the Y-shape dynamic Transformer (YDTR). Specifically, a dynamic Transformer module (DTRM) is designed to acquire not only the local features but also the significant context information. Furthermore, the proposed network is devised in a Y-shape to comprehensively maintain the thermal radiation information from the infrared image and scene details from the visible image. Considering the specific information provided by the source images, we design a loss function that consists of two terms to improve fusion quality: a structural similarity (SSIM) term and a spatial frequency (SF) term. Extensive experiments on mainstream datasets illustrate that the proposed method outperforms both classical and state-of-the-art approaches in both qualitative and quantitative assessments. We further extend the YDTR to address other infrared and RGB-visible images and multi-focus images without fine-tuning, and the satisfactory fusion results demonstrate that the proposed method has good generalization capability.

Abstract:
The research on micro-expression recognition has been drawing great attention in recent years, because of its great potential in the lie detection, clinical diagnosis, and national security. Amongst many challenges, data shortage stands out as it directly prevents an accurate training of micro-expression recognition algorithm. In this work, we present our approach within a dataset alignment and active learning (DAAL) framework. DAAL effectively queries minimum examples to label, as well as transfers features from micro-expression dataset to macro-expression dataset. Specifically, the features from micro-expression dataset are mapped to the macro-expression dataset with a translator, so that the classifier trained in macro-expression dataset can be adjusted and adapted to boost the classification performance on the micro-expression dataset. Besides, the most informative examples in the micro-expression dataset are selected through active learning in an iterative way, which effectively improves the classification ability of the model. Comprehensive experiments on CASME, CASME II, SAMM and SMIC databases firmly demonstrate that the proposed DAAL outperforms previous works by a large margin on micro-expression recognition task.

Abstract:
Action recognition in video understanding is a challenging task, largely because of the complexity and difficulty in temporal modeling, making it suffer from motion information loss and misalignment of temporal attention in spatial dimensions. To overcome these difficulties, we propose a novel temporal modeling method called Adjoint Enhancement Network (AE-Net), which can fully explore clues of motion and time in the long-range structure. The AE-Net mainly consists of two new modules: the Initial Adjoint Enhancement Module (IAE-Module), which deals with shallow features; and the Global Adjoint Enhancement Module (GAE-Module), which deals with global features. With a novel mechanism of parallel spatio-temporal convolution and difference fusion, the IAE-Module is to enhance the degree of motion transformation in shallow network features, exciting the potential of motion flow and avoiding motion information loss. The GAE-Module is proposed to improve the local temporal representation in long-range structures by feeding the enhanced feature differences into a spatial cascade module with residuals to resolve the misalignment of temporal attention in the spatial dimension.The experimental results show that our AE-Net can achieve state-of-the-art results in Something-Something V1, UCF-101 and HMDB-51 datasets.

Abstract:
Dietary assessment has proven to be effective to evaluate the dietary intake of patients with diabetes and obesity. The traditional approach of accessing the dietary intake is to conduct a 24-hour dietary recall, a structured interview designed to obtain information on food categories and volume consumed by the participants. Due to unconscious biases in this kind of self-reporting approaches, many research studies have explored the use of vision-based approaches to provide accurate and objective assessments. Despite the promising results of food recognition by deep neural networks, there still exist several hurdles in deep learning-based food volume estimation ranging from domain shift between synthetic and raw 3D models, shape completion ambiguity and lack of large-scale paired training dataset. Therefore, this paper proposed an intelligent nutritional assessment approach via weakly-supervised point cloud completion, which aims to close the reality gap in 3D point cloud completion tasks and address the targeted challenges. Then the volume can be easily estimated from the completed representation of the food. Another major merit of our system is that it can be used to estimate the volume of handheld food items without requiring the constraints including placing the food items on a table or next to fiducial markers, which facilitates the implementation on both wearable and handheld cameras. Comprehensive experiments have been carried out on major benchmark datasets and self-constructed volume-annotated dataset respectively, in which the proposed method demonstrates comparable results with several strong fully-supervised baseline methods and shows superior completion ability in handling food volume estimation.

Abstract:
In this paper, we propose a novel low-complexity in-loop filtering approach named textural and directional information based offset (TDIO) for the video coding standard AVS3. Different from conventional offset-based filtering methods which partially use contextual samples, the key contribution of TDIO is that it fully utilizes the textural and edge directional features of each sample to comprehensively determine which type of texture characteristics each sample belongs to. The corresponding offsets are generated and signaled to decoder such that sample-level distortion is reduced. Specifically, the multi-directionality and sample-intensity pattern based classifiers are first proposed to extract the directional and textural features, respectively. The classification results are obtained by incorporating these features, and the optimal offset values for each class are derived based on rate-distortion optimization. Since sample-level offset signalling may cause heavy burden to the overhead of TDIO, we subsequently propose a filtering offset sharing mechanism based on historical information between available temporal-adjacent compressed frames. In addition, an iteration-based filter adaptation method is designed to improve the local adaptivity of TDIO for better compression efficiency. Experimental results show that the proposed TDIO achieves 0.64%, 1.29%, 1.86%, and 2.20% bit rate savings for all intra, random access, low delay B, and low delay P configurations, respectively. Moreover, TDIO is helpful to improve subjective quality by leveraging the fine-grained local texture characteristics. It can be observed that the blurring and ringing artifacts could be significantly suppressed by using the proposed method, yielding higher subjective quality.

Abstract:
Topic modelling (TM) has shown significant progress in boosting the effectiveness of image captioning in the last few years. Although important improvements have been shown in previous topic-guided image captioning models, some challenges remain unsolved, such as the independence of the topic predictors and the sentence generators, resulting in ineffective exploitation of semantic information. Also, all the predicted topics or the top-one topic are used throughout the whole captioning task without considering the current time step's linguistic context, which deviates the captioning network to focus on inaccurate image objects. To tackle these challenges, we propose a novel image captioning method consisting of four modules: enhanced topic predictor (ETP), retrieval-based topics re-weighting module (RTR), subsequent topic predictor (STP), and caption generation module. The prediction and generation modules are trained in an end-to-end manner to promote the efficient use of topics by predicting suitable topics at each time step. ETP predicts the topics using the image features, and is enhanced with topic embedding (TE). The RTR is only applied in the testing stage for re-weighting the topics predicted by ETP. In each time step, the STP automatically predicts concise topics subsets to alleviate the diversity of the image topics. Compared with the existing topic-based models, our model can automatically generate more accurate and diverse captions, boosting the explainability of how the topics influence the generated word in each time step. Extensive experiments on the MS-COCO and Flickr30K benchmark datasets show that our method enhances the overall image captioning's performance and the topic prediction task, and outperforms many recent image captioning approaches in terms of the evaluation metrics.

Abstract:
Perceptual quality assessment of 3D synthesized views is an open research problem in computer vision. Researchers across the globe have developed several algorithms to identify distortions. At the same time, the existing algorithms cannot quantify the context in which these distortions affect the overall perceptual quality. According to the recently proposed 3D view synthesis algorithm, the choice of context region for the disocclusion plays a vital role in predicting the quality of 3D views. The context region taken from the background of a view produces a perceptually better quality of 3D synthesized views than when the context region is taken from the foreground. With this view, the proposed algorithm aims to identify the context region and incorporate this information for the perceptual quality assessment of 3D synthesized views. We observed that the depth energy maps of the 3D synthesized views vary significantly with the change in the context region and subsequently can identify the context region. Hence, in this work, we propose a new and efficient quality assessment algorithm based upon the variation in the depth of 3D synthesized and reference views, giving two-fold advantages: 1. It can predict the quality based on whether the context region is foreground or not. 2. It is also able to suggest the possible location of distortions. We have proposed two new algorithms for both situations when the context region is foreground or not. The overall predicted score is the direct multiplication of the quality score estimated when the context region is foreground or not. When applied to the established benchmark dataset, the proposed technique performs satisfactorily with the PLCC of 0.7707 and 0.7572 of SRCC. Also, the proposed algorithm can work as a plug-in to improve the performance of the existing algorithms.

Abstract:
Audio-driven talking face video generation has attracted much attention recently. However, few existing works pay attention to machine learning of talking head movement, especially based on the phonetic study. Observing that real-world talking faces often accompany natural head movement, in this paper, we model the relation between speech signal and talking head movement, which is a typical one-to-many mapping problem. To solve this problem, we propose a novel two-step mapping strategy: (1) in the first step, we train an encoder that predicts a head motion behavior pattern (modeled as a feature vector) from the head motion sequence of a short video of 10–15 seconds, and (2) in the second step, we train a decoder that predict a unique head motion sequence from both the motion behavior pattern and the auditory features of an arbitrary speech signal. Based on the proposed mapping strategy, we build a deep neural network model that takes a speech signal of a source person and a short video of a target person as input, and outputs a synthesized high-fidelity talking face video with personalized head pose. Extensive experiments and a user study show that our method can generate high-quality personalized head movement in synthesized talking face videos, and meanwhile, has comparable facial animation quality (e.g., lip synchronization and expression) with the state-of-the-art methods.

Abstract:
In recent years, low-rank representation (LRR) has received increasing attention on subspace clustering. Due to inevitable matrix inversion and singular value decomposition in each iteration, however, most of existing LRR algorithms may suffer from high computational complexity, and hence can not cope with the large-scale sample data commendably. To overcome this problem, in this paper, we propose a bilateral fast low-rank representation (BFLRR), which has a linear time complexity with respect to the number of samples. Specifically, we introduce the equivalent transformation method to remove the null spaces of both the columns and rows of the coefficient matrix so that a hypercompact coefficient matrix can be learned. Furthermore, the proposed BFLRR is embedded into a distributed framework as DFC-BFLRR to make it more efficient, which utilizes a combination of the global and local projection matrices. Extensive experiments are carried out on real datasets, and the results testify that the proposed methods not only perform faster-computing speed but also obtain favorable clustering accuracy in comparison with the competing methods among large-scale sample data.

Abstract:
In online clothing sales, static model images only describe specific clothing statuses towards consumers. Without increasing shooting costs, it is a subject to display clothing dynamically by synthesizing a continuous image sequence between static images. This paper proposes a novel human image sequence synthesis method by pose-shape-content inference. In the condition of two reference poses, the pose is interpolated in the pose manifold controlled by a linear parameter. The interpolated pose is transferred into the end shape by AdaIN and the attention mechanism to infer target shape. Then the content in the reference image is transferred into this target shape. In the content transfer, the visual features of the human body cluster and clothing cluster are extracted, respectively. And the Sobel gradient is adopted to extract clothing texture variation. In the feature inferring, the multiscale feature-level optical flow warps source features, and style code infusion infers new region content without source features. Extensive experiments demonstrate that our method is superior in inferring clear layouts and transferring reasonable content compared to the pose transfer baselines. Moreover, our method has been verified to apply in parsing-guided image inference and dynamic display based on the pose sequence.

Abstract:
Important people detection aims to identify the most important people (i.e., the people who play the main roles in scenes) in images, which is challenging since people's importance in images depends not only on their appearance but also on their interactions with others (i.e., relations among people) and their roles in the scene (i.e., relations between people and underlying events). In this work, we propose the People Relation Network (PRN) to solve this problem. PRN consists of three modules (i.e., the feature representation, relation and classification modules) to extract visual features, model relations and estimate people's importance, respectively. The relation module contains two submodules to model two types of relations, namely, the person-person relation submodule and the person-event relation submodule. The person-person relation submodule infers the relations among people from the interaction graph and the person-event relation submodule models the relations between people and events by considering the spatial correspondence between features. With the help of them, PRN can effectively distinguish important people from other individuals. Extensive experiments on the Multi-Scene Important People (MS) and NCAA Basketball Image (NCAA) datasets show that PRN achieves state-of-the-art performance and generalizes well when available data is limited.

Abstract:
Automatically generating videos in which synthesized speech is synchronized with lip movements in a talking head has great potential in many human-computer interaction scenarios. In this paper, we present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. In contrast to previous text-driven talking head generation methods, which can only synthesize the voice of a specific person, the proposed method is capable of synthesizing speech for any person. Specifically, the proposed method decomposes the generation of synchronized speech and talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a speech-driven talking head generation stage. The proposed TTS module is a face-conditioned multi-speaker TTS model that gets the speaker identity information from face images instead of speech, which allows us to synthesize a personalized voice on the basis of the input face image. To generate the talking head videos from the face images, a facial landmark-based method that can predict both lip movements and head rotations is proposed. Extensive experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons, in which the timbre of the synthesized voice is in harmony with the input face, and the proposed landmark-based talking head method outperforms the state-of-the-art landmark-based method on generating natural talking head videos.

Abstract:
Document intelligence as a relatively new research topic supports many business applications. Its main task is to automatically read, understand, and analyze documents. However, due to the diversity of formats (invoices, reports, forms, etc.) and layouts in documents, it is difficult to make machines understand documents. In this paper, we present the GraphDoc, a multimodal graph attention-based model for various document understanding tasks. GraphDoc is pre-trained in a multimodal framework by utilizing text, layout, and image information simultaneously. In a document, a text block relies heavily on its surrounding contexts, accordingly we inject the graph structure into the attention mechanism to form a graph attention layer so that each input node can only attend to its neighborhoods. The input nodes of each graph attention layer are composed of textual, visual, and positional features from semantically meaningful regions in a document image. We do the multimodal feature fusion of each node by the gate fusion layer. The contextualization between each node is modeled by the graph attention layer. GraphDoc learns a generic representation from only 320k unlabeled documents via the Masked Sentence Modeling task. Extensive experimental results on the publicly available datasets show that GraphDoc achieves state-of-the-art performance, which demonstrates the effectiveness of our proposed method.

Abstract:
In machine learning, the relatedness across multiple tasks is usually complex and entangled. Due to dataset bias, the relatedness among tasks might be distorted and mislead the training of the models with solid learning ability, such as the multi-task neural networks. In this paper, we propose the idea of Relatedness Refinement Multi-Task Learning (RRMTDL) by introducing adversarial learning in the multi-task deep neural network to tackle the problem. The RRMTDL deep learning model restrains the misleading relatedness task by adversarial training and extracts information sharing across tasks with valuable relatedness. With RRMTDL, multi-task deep learning can enhance the task-specific representation for the major tasks by excluding the misleading relatedness. We design tests with various combinations of task-relatedness to validate the proposed model. Experimental results show that the RRMTDL model can effectively refine the task relatedness and prominently outperform other multi-task deep learning models in datasets with entangled task labels.

Abstract:
In recent years, RGB-T salient object detection (SOD) has attracted continuous attention, which makes it possible to identify salient objects in environments such as low light by introducing thermal image. However, most of the existing RGB-T SOD models focus on how to perform cross-modality feature fusion, ignoring whether thermal image is really always matter in SOD task. Starting from the definition and nature of this task, this paper rethinks the connotation of thermal modality, and proposes a network named TNet to solve the RGB-T SOD task. In this paper, we introduce a global illumination estimation module to predict the global illuminance score of the image, so as to regulate the role played by the two modalities. In addition, considering the role of thermal modality, we set up different cross-modality interaction mechanisms in the encoding phase and the decoding phase. On the one hand, we introduce a semantic constraint provider to enrich the semantics of thermal images in the encoding phase, which makes thermal modality more suitable for the SOD task. On the other hand, we introduce a two-stage localization and complementation module in the decoding phase to transfer object localization cue and internal integrity cue in thermal features to the RGB modality. Extensive experiments on three datasets show that the proposed TNet achieves competitive performance compared with 20 state-of-the-art methods.

Abstract:
To localize text regions and separate close instances, the shrunk polygon is widely used in recent scene text detection methods. However, there exist two problems: 1) Existing methods fail to consider the aspect ratio sensitive problem when reconstructing the text instance from shrunk polygon. 2) Texts with extreme aspect ratios will lead to the fracture of shrunk polygons. To handle these two problems, in this paper, we propose a novel Adaptive Dilation Network (ADNet) to focus on the reconstruction process from shrunk polygon, which aims to provide a tight and complete text representation. Firstly, instead of using a fixed dilation factor, ADNet uses an aspect ratio-wise dilation factor to reconstruct the text region from shrunk polygon for each text instance. Such an instance-wise dilation factor considers the scale correlation between the original and shrunk polygon, and thus can guide an adaptive text region reconstruction for texts with large aspect ratio variance. Secondly, to deal with the fracture of detection results, a new Efficient Spatial Relationship Module (ESRM) is devised to capture long-range dependencies with low computation cost. ESRM uses a novel Weighted Pooling to reduce the resolution of feature maps without much information loss. Compared with the existing methods, ADNet further explores the potential of shrunk polygon-based approaches and obtains excellent detection results at an impressive speed. Extensive experiments on several datasets (Total-Text, CTW1500, MSRA-TD500 and ICDAR2015) verify the superiority of our method.

Abstract:
Domain Adaptive Object Detection (DAOD) transfers an object detector from the labeled source domain to a novel unlabelled target domain. Recent advances bridge the domain gap by aligning category-agnostic feature distribution and minimizing the domain discrepancy for adapting semantic distribution. Though great success, these methods model domain discrepancy with prototypes within a batch, yielding a biased estimation of domain-level statistics. Moreover, the category-agnostic alignment leads to the disagreement of the cross-domain semantic distribution with inevitable classification errors. To address these two issues, we propose an enhanced Semantic Conditioned AdaptatioN (SCAN++) framework, which leverages unbiased semantics for DAOD. Specifically, in the source domain, we design the conditional kernel to sample Pixel of Interests (PoIs), and aggregate PoIs with a cross-image graph to estimate an unbiased semantic sequence. Conditioned on the semantic sequence, we further update the parameter of the conditional kernel in a semantic conditioned manifestation module, and establish a novel conditional graph in the target domain to model unlabeled semantics. After modeling the semantic distribution in both domains, we integrate the conditional kernel into adversarial alignment to achieve semantic-aware adaptation in a Conditional Kernel guided Alignment (CKA) module. Meanwhile, the Semantic Sequence guided Transport (SST) module is proposed to transfer reliable semantic knowledge to the target domain through solving the cross-domain Optimal Transport (OT) assignment, achieving unbiased adaptation at the semantic level. Comprehensive experiments on four adaptation scenarios demonstrate that SCAN++ achieves state-of-the-art results. The code is available at https://github.com/CityU-AIM-Group/SCAN/tree/SCAN++.

Abstract:
Manipulating visual attributes of an image through a natural language description, known as text-to-image attributes manipulation (T2AM), is a challenging task. However, existing approaches tend to search the whole image to manipulate the target instance indicated by a description, thus they often fail to locate and manipulate the accurate text-relevant regions, and even disturb the text-irrelevant contents, e.g. texture and background. Meanwhile, the model efficiency needs to be improved. To tackle the above issues, we introduce a novel yet simple GAN-based approach, namely Structuring Image for Manipulating (SIMGAN), to narrow down the optimization areas from external to internal. It consists of two major components: 1) External Structuring (ExST), a pretrained segmentation network, for recognizing and separating the target instances and background from an image; and 2) Internal Structuring (InST) for seeking out and editing the text-relevant attributes of the target instances based on the given description and masked hierarchical image representations from ExST. Specifically, the InST structures target instances from outline to detail by firstly drawing the sketch and colors underpainting of instances with an Outline-Oriented Structuring (OuST), and then enhancing the text-relevant attributes and elaborating on details with a Detail-Oriented Structuring (DeST). Extensive experiments on benchmark datasets demonstrate that our framework significantly outperforms state-of-the-art both quantitatively and qualitatively. Compared with the state-of-the-art method ManiGAN, our approach reduces the training time by 88%, while the inferring time is three times faster. In addition, our approach is easily extended to solve the instance-level image-to-image translation problem, and the results exhibit the versatility and effectiveness of our approach. This code is released in https://github.com/qikizh/SIMGAN.

Abstract:
Multi-label zero-shot learning extends conventional single-label zero-shot learning to a more realistic scenario that aims at recognizing multiple unseen labels of classes for each input sample. Existing works usually exploit attention mechanism to generate the correlation among different labels. However, most of them are usually biased on several major classes while neglect most of the minor classes with the same importance in input samples, and may thus result in overly diffused attention maps that cannot sufficiently cover minor classes. We argue that disregarding the connection between major and minor classes, i.e., correspond to the global and local information, respectively, is the cause of the problem. In this paper, we propose a novel framework of unbiased multi-label zero-shot learning, by considering various class-specific regions to calibrate the training process of the classifier. Specifically, Pyramid Feature Attention (PFA) is proposed to build the correlation between global and local information of samples to balance the presence of each class. Meanwhile, for the generated semantic representations of input samples, we propose Semantic Attention (SA) to strengthen the element-wise correlation among these vectors, which can encourage the coordinated representation of them. Extensive experiments on the large-scale multi-label benchmarks MS-COCO, NUS-WIDE and Open-Images demonstrate that the proposed method surpasses other representative methods by significant margins.

Abstract:
Current point-based trackers are usually implemented by the following two branches: a classification branch for predicting the target candidate locations and a regression branch for regressing the tracking box, which may lead to a spatial misalignment between the two tasks. Meanwhile, they ignore a meaningful exploration on how to define positive and negative samples during training and explicit border information for accurate box prediction. In this research, we investigate the key issues of point-based trackers and unlock their key limitations. First, we design a novel task-aligned component and a new loss function, named task-aligned loss, to learn the alignment of the classification and regression tasks. Second, we introduce a border alignment (BorderAlign) component in both the classification and regression branches to effectively exploit the border features of a tracking target. Third, we develop an adaptive training sample assignment (ATSA) to adaptively divide the positive and negative samples based on the statistical characteristics of the tracking object. Finally, a deformable transformer is developed to enhance the representations of search features and explore rich temporal contexts among video frames. Extensive experimental results demonstrate that the proposed tracker achieves state-of-the-art performance on six tracking benchmark datasets.

Abstract:
Deep neural networks have made significant progress in various tasks under the assumption of the same distribution between training and testing data. However, the obtained domain-specific knowledge often suffers from performance degradation when facing out-of-distribution data. Towards addressing the degradation, a critical requirement of such networks is the generalization capability to unseen domains, which is the goal of domain generalization (DG). This paper attempts to learn generalized knowledge from a single synthetic domain and then apply it to real and unknown scenarios. Specifically, we propose a contour-aware instance normalization module to effectively learn domain-invariant features via a novel weight-updating strategy, which can largely exploit the generalized information from the observed data. In addition, a category-level contrastive learning mechanism is proposed through understanding the semantic discrepancy and relevance among samples to mitigate the interference of domain-specific features on classification. Extensive experiments together with ablation studies on widely-adopted datasets are conducted to demonstrate the effectiveness of our design and show the superiority of our method over other state-of-the-art schemes on the task of urban-scene segmentation.

Abstract:
Deep metric learning has been widely used in many visual tasks. Its key idea is to increase the similarity of positive samples and decrease the similarity of negative samples through network training. To achieve this purpose, many studies excessively extend the distance between the query sample and hard negative samples. This may compress the distance between similar samples of other classes, causing these samples to cluster together. We call this phenomenon Negative Sample Aggregation. To address this problem, first, we propose a weighting method based on the Ranking Similarity of sample pairs, short for RS. The proposed weighting method can not only enlarge the distance between the query sample and hard negative samples, but also maintain the embedding distribution of proximal negative samples. Second, we propose a Top-nk sampling method, which can dynamically adjust the sampling strategy according to the distribution of a dataset. It solves the problem that the descent direction of the network gradient is inconsistent with the optimization target. The effectiveness of our methods is evaluated by extensive experiments on four public datasets and compared with that of other state-of-the-art methods. The results show that the proposed method obtains excellent performance, reaching 67.8% on CUB-200-2011 and 85.2% on Cars-196 at Recall@1.

Abstract:
Deep metric learning aims to learn an embedding space, where semantically similar samples are close together and dissimilar ones are repelled against. To explore more hard and informative training signals for augmentation and generalization, recent methods focus on generating synthetic samples to boost metric learning losses. However, these methods just use the deterministic and class-independent generations (e.g., simple linear interpolation), which only can cover the limited part of distribution spaces around original samples. They have overlooked the wide characteristic changes of different classes and can not model abundant intra-class variations for generations. Therefore, generated samples not only lack rich semantics within the certain class, but also might be noisy signals to disturb training. In this paper, we propose a novel intra-class adaptive augmentation (IAA) framework for deep metric learning. We reasonably estimate intra-class variations for every class and generate adaptive synthetic samples to support hard samples mining and boost metric learning losses. Further, for most datasets that have a few samples within the class, we propose the neighbor correction to revise the inaccurate estimations, according to our correlation discovery where similar classes generally have similar variation distributions. Extensive experiments on five benchmarks show our method significantly improves and outperforms the state-of-the-art methods on retrieval performances by 3%-6%.

Abstract:
An objective understanding of media depictions, such as inclusive portrayals of how much someone is heard and seen on screen such as in film and television, requires the machines to discern automatically who, when, how, and where someone is talking, and not. Speaker activity can be automatically discerned from the rich multimodal information present in the media content. This is however a challenging problem due to the vast variety and contextual variability in media content, and the lack of labeled data. In this work, we present a cross-modal neural network for learning visual representations, which have implicit information pertaining to the spatial location of a speaker in the visual frames. Avoiding the need for manual annotations for active speakers in visual frames, acquiring of which is very expensive, we present a weakly supervised system for the task of localizing active speakers in movie content. We use the learned cross-modal visual representations, and provide weak supervision from movie subtitles acting as a proxy for voice activity, thus requiring no manual annotations. Furthermore, we propose an audio-assisted post-processing formulation for the task of active speaker detection. We evaluate the performance of the proposed system on three benchmark datasets: i) AVA active speaker dataset, ii) Visual person clustering dataset, and iii) Columbia datset, and demonstrate the effectiveness of the cross-modal embeddings for localizing active speakers in comparison to fully supervised systems.

Abstract:
Compared with natural videos, screen content videos (SCVs) have particular features, such as fruitful sharper edges, lots of computer-generated graphics and texts, a large amount of flat areas. New tools are adopted to HEVC extensions on Screen Content Coding (HEVC-SCC), the traditional video rate control methods for natural videos are not effective for SCVs. For that, a 3D-gradient guided rate control model for SCV coding, named 3DG-RC, is proposed to allocate bitrate more efficiently serving for SCVs. By considering the particular spatial-temporal characteristics of SCVs, the spatial and temporal feature extraction scheme is developed by using 3D-gradient filter and performed on the SCV to extract the spatial and temporal features simultaneously for guiding the bit allocation. The spatial-temporal feature similarity between three original reference SCV frames and their reconstructed ones is used to estimate the encoding parameters of the current block and frame. Experimental results demonstrate that compared with the classical and state-of-the-art rate control methods for HEVC-SCC, the proposed 3DG-RC algorithm achieves significant bitrate mismatch reduction and coding efficiency improvement for HEVC-SCC. In specific, the proposed 3DG-RC model outperforms the rate control model in SCM-8.8 with over 41.33% and 37.95% BD-BR savings on average, for low delay B (LDB) and random access (RA) coding structure, respectively.

Abstract:
Recently, supervised deep-learning methods have shown their effectiveness on raw video denoising in low-light. However, existing training datasets have specific drawbacks, e.g., inaccurate noise modeling in synthetic datasets, simple motion created by hand or fixed motion, and limited-quality ground truth caused by the beam splitter in real captured datasets. These defects significantly decline the performance of network when tackling real low-light video sequences, where noise distribution and motion patterns are extremely complex. In this paper, we collect a raw video denoising dataset in low-light with complex motion and high-quality ground truth, overcoming the drawbacks of previous datasets. Specifically, we capture 210 paired videos, each containing short/long exposure pairs of real video frames with dynamic objects and diverse scenes displayed on a high-end monitor. Besides, since spatial self-similarity has been extensively utilized in image tasks, harnessing this property for network design is more crucial for video denoising as temporal redundancy. To effectively exploit the intrinsic temporal-spatial self-similarity of complex motion in real videos, we propose a new Transformer-based network, which can effectively combine the locality of convolution with the long-range modeling ability of 3D temporal-spatial self-attention. Extensive experiments verify the value of our dataset and the effectiveness of our method on various metrics.

Abstract:
The recent development of deep learning has brought breakthroughs in image denoising. However, the recovery of image detail, especially high-frequency weak information, still needs to be improved. Firstly, the noise mainly concentrates on the high-frequency signal, and the high-frequency signal is easy to be disturbed, which makes it difficult to recover; Secondly, in the process of image denoising with deep learning, feature extraction of model is used to smooth the noise for image restoration, resulting in a poor recovery effect of high-frequency signal. To solve the above problems and improve the overall image denoising performance, we propose a denoising network for complex frequency band signal processing (CFPNet), which contains three insights: 1) the image input node uses a cosine transform to segment the image noise frequency and divides different image features into signals in different frequency bands for targeted noise reduction; 2) targeted noise reduction is carried out for different frequency band signals via a fine-grained scheme; 3) different frequency band signals are fused and high-frequency signals are enhanced to improve the recovery of detailed signals. The experimental results show that the proposed CFPNet can achieve state-of-the-art performance on both real-world datasets and Gaussian noise fitting datasets.

Abstract:
With the rapid growth in the three-dimensional (3D) printing content market, various unprecedented criminal cases and copyright protection issues have emerged. In response to this imminent and emergent difficulty, we propose a forensic technique for identifying the source of 3D printed products based only on surface inspection features. The surface texture of 3D printed objects exhibits, inevitably, extremely fine periodic features during the additive manufacturing process. We propose a two-stream texture encoder, referred to as CFTNet, combined with fast Fourier transform and positional encoding of the transformer encoder to leverage inherent periodic features occurring during the additive manufacturing. As benchmarks, we define detailed scenarios for six source identification problems and present detailed verification procedures with a large-scale benchmark dataset SI3DP++ for forensic real-world scenarios. A certain level of performance was achieved using six benchmarks, including printer and device-level identification. Moreover, we extended the baseline study based on the benchmark set to forensic test scenarios from multiple perspectives in preparation for real situations. We reveal both the dataset and detailed experimental design to provide an opportunity to facilitate future in-depth studies related to forensics and protection of intellectual property.

Abstract:
Adaptive image steganography is the process of embedding secret messages into undetectable regions of a cover image through the design of a distortion function by a steganographer. Since the state-of-the-art steganalyzers are mainly based on image residual analysis, it is reasonable to modify stego image for withstanding steganalysis by reducing or eliminating the image residual distance between cover and stego image. However, simply modifying stego images may lead to message extraction failure and the introduction of additional detectable artifacts. In this paper, we propose a novel secure steganography strategy by constructing immunized stego-image via an artificial immune system, called ISteg, which ensures the accurate extraction of hidden data while enhancing the security against steganalyzers. Inspired by the biological immune system, we use an artificial immune system (AIS) to build ISteg. Specifically, ISteg generates the immunized stego-image by automatically modifying the stego to maximize the affinity of the antibody. The affinity is developed to evaluate antibody quality according to the Euclidean distance between the residual co-occurrence matrix features of the cover image and the modified stego image. In this manner, the so-called immunized stego-image is generated. Extensive experimental results demonstrate that the proposed ISteg strategy can effectively improve the security performance of existing steganography.

Abstract:
In this paper, we study the composed query image retrieval, which aims at retrieving the target image similar to the composed query, i.e., a reference image and the desired modification text. Compared with conventional image retrieval, this task is more challenging as it not only requires precisely aligning the composed query and target image in a common embedding space, but also simultaneously extracting related information from the reference image and modification text. In order to properly extract related information from the composed query, existing methods usually embed vision-language inputs using different feature encoders, e.g., CNN for images and LSTM/BERT for text, and then employ a complicated manually-designed composition module for learning the joint image-text representation. However, the architecture discrepancy in feature encoders would restrict the vision-language plenitudinous interaction. Meanwhile, certain complicated composition designs might significantly hamper the generalization ability of the model. To tackle these problems, we propose a new framework termed ComqueryFormer, which effectively processes the composed query with the Transformer for this task. Specifically, to eliminate the architecture discrepancy, we leverage a unified transformer-based architecture to homogeneously encode the vision-language inputs. Meanwhile, instead of the complicated composition module, the neat yet effective cross-modal transformer is adopted to hierarchically fuse the composed query at various vision scales. On the other hand, we introduce an efficient global-local alignment module to narrow the distance between the composed query and the target image. It not only considers the divergence in the global joint embedding space but also forces the model to focus on the local detail differences. Extensive experiments on three real-world datasets demonstrate the superiority of our ComqueryFormer.

Abstract:
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. Such frame-level feature extraction leads to the obstacles of these methods in distinguishing ambiguous video frames with complicated contents and subtle appearance differences, thus limiting their performance. In order to differentiate fine-grained appearance similarities among consecutive frames, some state-of-the-art methods additionally employ a detection model like Faster R-CNN to obtain detailed object-level features in each frame for filtering out the redundant background contents. However, these methods suffer from missing motion analysis since the object detection module in Faster R-CNN lacks temporal modeling. To alleviate the above limitations, in this paper, we propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features to better reason the spatial-temporal object relations for accurately modelling the activity among consecutive frames. Specifically, we first develop three individual branches for motion, appearance, and 3D encoding separately to learn fine-grained motion-guided, appearance-guided, and 3D-aware object features, respectively. Then, both motion and appearance information from corresponding branches are associated to enhance the 3D-aware features for the final precise grounding. Extensive experiments on three challenging datasets (ActivityNet Caption, Charades-STA and TACoS) demonstrate that the proposed MA3SRN model achieves a new state-of-the-art.

Abstract:
Motion prediction from raw LiDAR sensor data has drawn increasing attention and led to a surge of studies following two main paradigms. One paradigm is global motion paradigm, which simultaneously detects objects from point clouds and predicts the trajectories of each object in the future. The other paradigm is local motion paradigm, which directly performs dense motion prediction pointwisely. We observe that global motion prediction can benefit from local motion representation, since it contains rich local displacement contexts that are not explicitly exploited in global motion prediction. Correspondingly, local motion prediction can benefit from global motion representation, since it provides object contexts to improve prediction consistency inside an object. However, the complement of these two motion representations has not fully explored in the literature. To this end, we propose Hybrid Motion Representation Learning (HyMo), a unified framework to address the problem of motion prediction by making the best of both global and local motion cues. We have conducted extensive experiments on nuScenes dataset. The experimental results demonstrate that the learned hybrid motion representation achieves state-of-the-art performance on both global and local motion prediction tasks.

Abstract:
Existing defocus blur detection (DBD) methods generally perform well on a single type of unfocused blur scene (e.g., foreground focus), thereby suffering from the performance degradation for the other types of unfocused blur scenes. In this paper, we present the first exploration on full-scene DBD, and propose a separate-and-combine framework to achieve excellent performance for diverse defocus blur scenes. We firstly structure full-scene DBD dataset (named as DeFBD+) through collecting more types of unfocused blur scenes (e.g., background focus, full focus and full out of focus) with pixel-level annotations. Then, to avoid performance degradation caused by mutual interference from local feature representation and global content perception, we implement a pixel-level DBD network and an image-level DBD classification network to learn these two abilities separately. After that, we propose an isomeric distillation mechanism to combine these two abilities. Extensive experiments show that the proposed approach achieves superior performance compared with state-of-the-art methods.

Abstract:
Most person re-identification (Re-ID) approaches rely excessively on a great quantity of annotated training data. However, due to sampling errors or annotated errors, the label noise is unavoidable, which usually causes a dramatic decrease in the performance of existing Re-ID methods. To address this problem, we propose the label reliability perception (LRP) for person Re-ID by refining noisy labels. Specifically, a feature-fusion block (FFB) is proposed to enhance the discrimen- ability of pedestrians’ features by expanding the network's attention span due to the fused feature, which is generated by overlapping the coarse-grained feature obtained by global average pooling and fine-grained features obtained by evenly dividing the feature map in the height dimension and performing global max pooling. In addition, the label dual perception (LDP) is proposed to refine noisy labels instead of filtering samples by evaluating the reliability of each training sample's label. Specifically, we meticulously design five evaluation modes for each sample to perceive the reliability of the labels of the k-nearest neighbor images. Finally, we utilize the most reliable label to replace the noisy label and optimize the network. Extensive experiments prove the superiority of the proposed model over the competing methods; for instance, on Market1501, our method achieves 88.8% rank-1 accuracy and 70.5% mAP (4.7% and 4.3% improvements over the state-of-the-arts) under noise ratio 20%, and similarly on DukeMTMC-ReID, our method achieves 77.7% and 60.3%.

Abstract:
Recent Transformer architectures (Vaswani et al., 2017) have brought remarkable improvements to visual question answering (VQA). Nevertheless, Transformer-based VQA models are usually deep and wide to guarantee good performance, so they can only run on powerful GPU servers and cannot run on capacity-restricted platforms such as mobile phones. Therefore, it is desirable to learn an elastic VQA model that supports adaptive pruning at runtime to meet the efficiency constraints of different platforms. To this end, we present the bilaterally slimmable Transformer (BST), a general framework that can be seamlessly integrated into arbitrary Transformer-based VQA models to train a single model once and obtain various slimmed submodels of different widths and depths. To verify the effectiveness and generality of this method, we integrate the proposed BST framework with three typical Transformer-based VQA approaches, namely MCAN (Yu et al., 2019), UNITER (Chen et al., 2020), and CLIP-ViL (Shen et al., 2021), and conduct extensive experiments on two commonly-used benchmark datasets. In particular, one slimmed MCAN_\mathsf BST submodel achieves comparable accuracy on VQA-v2, while being 0.38× smaller in model size and having 0.27× fewer FLOPs than the reference MCAN model. The smallest MCAN_\mathsf BST submodel only has 9 M parameters and 0.16 G FLOPs during inference, making it possible to deploy it on a mobile device with less than 60 ms latency.

Abstract:
Domain generalization aims to generalize a network trained on multiple domains to unknown yet related domains. Operating under the assumption that invariant information generalizes well to unknown domains, previous work has aimed to minimize the discrepancies amongst distributions across given domains. However, without prior regularization of feature distributions, the network in practice overfits the invariant information in the given domains. Moreover, if there are insufficient samples in given domains, then domain generalizability is limited, as diverse domain variations are not captured. To address these two drawbacks, we propose to explicitly map features in known and unknown domains onto latent space in a fixed Gaussian mixture distribution by variational coding. As a result, features in different classes follow Gaussian distributions with different mean values. The predefined latent space narrows discrepancies between known and unknown domains and effectively separates samples into different classes. Moreover, we propose to perturb sample features with gradients from the distribution regularized loss. This perturbation generates samples beyond but near the latent space of prior distributions, which has a profound impact on domain variations. Experiments and visualizations demonstrate the effectiveness of our proposed method.

Abstract:
Driverdrowsiness is an important cause of traffic accidents. Many studies using computer vision techniques to detect driver drowsiness states, such as slow blinking, yawning, and nodding, have demonstrated excellent potential. Although existing studies have made significant progress, the number of samples in the training corpora is small, which makes it difficult for a model to learn effective drowsiness representations from images or videos. To address this issue, we develop an isotropic self-supervised learning (IsoSSL) approach to learn powerful representations of images without relying on human-provided annotations and propose an IsoSSL-MoCo model by combining IsoSSL with momentum contrast (MoCo). To exploit the complementarity of multimodal data, an attention-based multimodal fusion model is also proposed to fuse features from the eye, mouth, and optical flow of the head. Specifically, we first use the IsoSSL-MoCo model to pretrain the image encoders for the three modalities in other datasets. Then, these encoders are fine-tuned and integrated into the proposed fusion model. The feature vectors generated by the image encoders of the three modalities are fed into the recursive layer to extract temporal information. To capture the importance degrees of the effects of temporal features from the three modalities on drowsiness detection, an attention mechanism is introduced to automatically weigh the feature vectors from the recursive layer to improve detection accuracy. Finally, a vector representation is generated by the attention layer and is used to detect driver drowsiness states. Experimental results based on two challenging datasets show that our method outperforms the baseline methods and the latest existing methods.

Abstract:
3D Anthropometric measurement extraction is of paramount importance for several applications such as clothing design, online garment shopping, and medical diagnosis, to name a few. State-of-the-art 3D anthropometric measurement extraction methods estimate the measurements either through some landmarks found on the input scan or by fitting a template to the input scan using optimization-based techniques. Finding landmarks is very sensitive to noise and missing data. Template-based methods address this problem, but the employed optimization-based template fitting algorithms are computationally very complex and time-consuming. To address the limitations of existing methods, we propose a deep neural network architecture which fits a template to the input scan and outputs the reconstructed body as well as the corresponding measurements. Unlike existing template-based anthropocentric measurement extraction methods, the proposed approach does not need to transfer and refine the measurements from the template to the deformed template, thereby being faster and more accurate. A novel loss function, especially developed for 3D anthropometric measurement extraction is introduced. Additionally, two large datasets of complete and partial front-facing scans are proposed and used in training. This results in two models, dubbed Anet-complete and Anet-partial, which extract the body measurements from complete and partial front-facing scans, respectively. Experimental results on synthesized data as well as on real 3D scans captured by a photogrammetry-based scanner, an Azure Kinect sensor, and the very recent TrueDepth camera system demonstrate that the proposed approach systematically outperforms the state-of-the-art methods in terms of accuracy and robustness.

Abstract:
Image-based vehicle re-identification (ReID) has witnessed much progress in recent years. However, most of existing works struggled to extract robust but discriminative features from a single image to represent one vehicle instance. We argue that images taken from distinct viewpoints, e.g., front and back, have significantly different appearances and patterns for recognition. In order to identify each vehicle, these models have to capture consistent “ID codes” from totally different views, causing learning difficulties. Additionally, we claim that part-level correspondences among views, i.e., various vehicle parts observed from the identical image and the same part visible from different viewpoints, contribute to instance-level feature learning as well. Motivated by these, we propose to extract comprehensive vehicle instance representations from multiple views through modelling part-wise correlations. To this end, we present our efficient transformer-based framework to exploit both inner- and inter-view correlations for vehicle ReID. In specific, we first adopt a convnet encoder to condense a series of patch embeddings from each view. Then our efficient transformer, consisting of a distillation token and a noise token in addition to a regular classification token, is constructed for enforcing these patch embeddings to interact with each other regardless of whether they are taken from identical or different views. We conduct extensive experiments on widely used vehicle ReID benchmarks, and our approach achieves the state-of-the-art performance, showing the effectiveness of our method.

Abstract:
Compression technology for representing image is on demand for efficiently processing images in the Big Data era. Image hashing is an effective compression technology for computing a short representation based on visual content of input image. Currently, most reported image hashing algorithms have weakness in making a desirable classification between discrimination and robustness and thus can not reach good performance in copy detection. To address these issues, this paper proposes a new robust image hashing with Isometric Mapping (Isomap) and saliency map for copy detection. A key contribution is hash generation with saliency map determined by the Frequency Tuned (FT) method, which can guarantee robustness of the proposed image hashing. Another contribution is the use of Isomap in deriving hash from the FT-based saliency map. Since Isomap can discover the internal geometry features of image, the use of Isomap can learn discriminative image features and thus discrimination of the proposed image hashing is ensured. Experiments on open image databases are carried out. Comparison results illustrate that the proposed image hashing is better than some state-of-the-art algorithms in the performances of classification and copy detection.

Abstract:
Weakly supervised image segmentation trained with image-level labels usually suffers from inaccurate coverage of object areas during the generation of the pseudo groundtruth. This is because the object activation maps are trained with the classification objective and lack the ability to generalize. To improve the generality of the object activation maps, we propose a region prototypical network (RPNet) to explore the cross-image object diversity of the training set. Similar object parts across images are identified via region feature comparison. Object confidence is propagated between regions to discover new object areas while background regions are suppressed. Experiments show that the proposed method generates more complete and accurate pseudo object masks while achieving state-of-the-art performance on PASCAL VOC 2012 and MS COCO. In addition, we investigate the robustness of the proposed method on reduced training sets. The code is available at https://github.com/liuweide01/RPNet-Weakly-Supervised-Segmentation.

Abstract:
Currently, an increasing number of applications and services has encouraged users to openly express their emotions via images. Unlike visual sentiment classification, visual sentiment distribution learning exploits the overall distribution to represent the relative importance of sentiment labels. Considering that most relevant studies have failed to completely model correlation structures or explicitly apply them to unknown instances, in this paper, we proposed a low-rank latent Gaussian graphical model estimation (LGGME) method for visual sentiment distribution learning tasks. There are three main characteristics of LGGME: 1) an integrated inverse covariance matrix whose parameters characterize the latent correlation structures between and within features and sentiments is estimated based on the sparse Gaussian graphical model; 2) a multivariate normal assumption is assigned on the concatenated latent feature representations and the estimated sentiment distributions instead of the original observations for a reasonable surrogate; and 3) the latent feature representations are projected from a low-rank subspace, which is also available for unseen instances, and the estimated sentiment distributions are evaluated by KL divergence to ensure a suitable setting for distribution learning. We further developed an effective optimization algorithm based on the alternating direction method of multipliers (ADMM) for our objective function. The experimental results obtained on three publicly available datasets demonstrate the superiority of our proposed method.

Abstract:
Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and aggregate information from local contexts. The modified VTE is termed as Strided Transformer Encoder (STE), which is built upon the outputs of VTE. STE not only effectively aggregates long-range information to a single-vector representation in a hierarchical global and local fashion, but also significantly reduces the computation cost. Furthermore, a full-to-single supervision scheme is designed at both full sequence and single target frame scales applied to the outputs of VTE and STE, respectively. This scheme imposes extra temporal smoothness constraints in conjunction with the single target frame supervision and hence helps produce smoother and more accurate 3D poses. The proposed Strided Transformer is evaluated on two challenging benchmark datasets, Human3.6 M and HumanEva-I, and achieves state-of-the-art results with fewer parameters. Code and models are available at https://github.com/Vegetebird/StridedTransformer-Pose3D.

Abstract:
Face representation in the wild is extremely hard due to the large scale face variations. Some deep convolutional neural networks (CNNs) have been developed to learn discriminative feature by designing properly margin-based losses, which perform well on easy samples but fail on hard samples. Although some methods mainly adjust the weights of hard samples in training stage to improve the feature discrimination, they overlook the distribution property of feature. It is worth noting that the miss-classified hard samples may be corrected from the feature distribution view. To overcome this problem, this paper proposes the hard samples guided optimal transport (OT) loss for deep face representation, OTFace in short. OTFace aims to enhance the performance of hard samples by introducing the feature distribution discrepancy while maintaining the performance on easy samples. Specifically, we embrace triplet scheme to indicate hard sample groups in one mini-batch during training. OT is then used to characterize the distribution differences of features from the high level convolutional layer. Finally, we integrate the margin-based-softmax (e.g. ArcFace or AM-Softmax) and OT together to guide deep CNN learning. Extensive experiments were conducted on several benchmark databases. The quantitative results demonstrate the advantages of the proposed OTFace over state-of-the-art methods.

Abstract:
With the growing importance of preventing the COVID-19 virus in cyber-manufacturing security, face images obtained in most video surveillance scenarios are usually low resolution together with mask occlusion. However, most of the previous face super-resolution solutions can not efficiently handle both tasks in one model. In this work, we consider both tasks simultaneously and construct an efficient joint learning network, called JDSR-GAN, for masked face super-resolution tasks. Given a low-quality face image with mask as input, the role of the generator composed of a denoising module and super-resolution module is to acquire a high-quality high-resolution face image. The discriminator utilizes some carefully designed loss functions to ensure the quality of the recovered face images. Moreover, we incorporate the identity information and attention mechanism into our network for feasible correlated feature expression and informative feature learning. By jointly performing denoising and face super-resolution, the two tasks can complement each other and attain promising performance. Extensive qualitative and quantitative results show the superiority of our proposed JDSR-GAN over some competitive methods.

Abstract:
Wearing masks can effectively inhibit the spread and damage of COVID-19. A device-edge-cloud collaborative recognition architecture is designed in this paper, and our proposed device-edge-cloud collaborative recognition acceleration method can make full use of the geographically widespread computing resources of devices, edge servers, and cloud clusters. First, we establish a hierarchical collaborative occluded face recognition model, including a lightweight occluded face detection module and a feature-enhanced elastic margin face recognition module, to achieve the accurate localization and precise recognition of occluded faces. Second, considering the responsiveness of occluded face detection services, a context-aware acceleration method is devised for collaborative occluded face recognition to minimize the service delay. Experimental results show that compared with state-of-the-art recognition models, the proposed acceleration method leveraging device-edge-cloud collaborations can effectively reduce the recognition delay by 16% while retaining the equivalent recognition accuracy.

Abstract:
Deep dictionary learning (DDL) aims to learn dictionaries at different levels and the deepest level representations. However, existing DDL algorithms impose a l_1-norm constraint on the deepest level representations, ignoring the constraints on different level representations. Meanwhile, they fail to discover effectively the essential discrimination information. Therefore, the obtained representations are less discriminative, which degrades model performance. To tackle those issues, we propose an intra- and inter-class induced discriminative deep dictionary learning (DDDL). Specifically, both intra-class compactness and inter-class separability of layer-wise data representations are newly devised as two discriminative constraints on deep dictionary learning. In a hierarchical structure, we obtain a more informative dictionary and the class-specific representations are thus more discriminative at each layer. Due to the l_2-norm intra- and inter-class constraints of layer-wise data representation, we devise a layer-wise optimization strategy to efficiently learn the closed-form solution of the deepest representation for classification. Comprehensive experiments and analyses on several visual recognition tasks show that our DDDL model surpasses recent shallow and deep representation learning approaches.

Abstract:
Automatic facial action unit (AU) recognition is a challenging task due to the scarcity of manual annotations. To alleviate this problem, a large amount of efforts has been dedicated to exploiting various weakly supervised methods which leverage numerous unlabeled data. However, many aspects with regard to some unique properties of AUs, such as the regional and relational characteristics, are not sufficiently explored in previous works. Motivated by this, we take the AU properties into consideration and propose two auxiliary AU related tasks to bridge the gap between limited annotations and the model performance in a self-supervised manner via the unlabeled data. Specifically, to enhance the discrimination of regional features with AU relation embedding, we design a task of RoI inpainting to recover the randomly cropped AU patches. Meanwhile, a single image based optical flow estimation task is proposed to leverage the dynamic change of facial muscles and encode the motion information into the global feature representation. Based on these two self-supervised auxiliary tasks, local features, mutual relation and motion cues of AUs are better captured in the backbone network. Furthermore, by incorporating semi-supervised learning, we propose an end-to-end trainable framework named weakly supervised regional and temporal learning (WSRTL) for AU recognition. Extensive experiments on BP4D and DISFA demonstrate the superiority of our method and new state-of-the-art performances are achieved.

Abstract:
Although monocular 3D human pose estimation methods have made significant progress, it is far from being solved due to the inherent depth ambiguity. Instead, exploiting multi-view information is a practical way to achieve absolute 3D human pose estimation. In this paper, we propose a simple yet effective pipeline for weakly-supervised cross-view 3D human pose estimation. By only using two camera views, our method can achieve state-of-the-art performance in a weakly-supervised manner, requiring no 3D ground truth but only 2D annotations. Specifically, our method contains two steps: triangulation and refinement. First, given the 2D keypoints that can be obtained through any classic 2D detection methods, triangulation is performed across two views to lift the 2D keypoints into coarse 3D poses. Then, a novel cross-view U-shaped graph convolutional network (CV-UGCN), which can explore the spatial configurations and cross-view correlations, is designed to refine the coarse 3D poses. In particular, the refinement progress is achieved through weakly-supervised learning, in which geometric and structure-aware consistency checks are performed. We evaluate our method on the standard benchmark dataset, Human3.6M. The Mean Per Joint Position Error on the benchmark dataset is 27.4 mm, which outperforms existing state-of-the-art methods remarkably (27.4 mm vs 30.2 mm).

Abstract:
Multi-person action forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward video understanding at a semantic level. This task is difficult due to the complexity of spatial and temporal dependencies. Yet, the state-of-the-art literature does not seem to be adequately responsive to this challenge. Hence, how to better foresee the forthcoming actions per actor has to be further pursued. Toward this end, we put forth a novel RElational Spatio-TEmPoral learning approach (RESTEP) for multi-person action forecasting. Our RESTEP explores the key that inherently characterizes actions from a perspective of incorporating the spatial and temporal information in a single pass (spatio-temporal dependencies) by extending relational reasoning. As a result, the RESTEP enables simultaneously predicting the actions of all actors in the scene. Our proposal significantly differs from mainstream works that heavily rely on independently processing the spatial and temporal dependencies. The proposed RESTEP first perceives a graph building upon the historical observations, then reasons the relational spatio-temporal context to extrapolate future actions. In order to augment the comprehension of individual actions that might vary over time, we further delve deeper into the essence behind this point – the evolution of spatio-temporal dependencies via optimizing the corresponding mutual information. We assess the RESTEP method on the large-scale Atomic Visual Actions (AVA) dataset, Activities in Extended Videos (ActEV/VIRAT) dataset and Joint-annotated Human Motion Data Base (J-HMDB). The experimental outcomes reveal that RESTEP can introduce considerable improvements with respect to recent leading studies.

Abstract:
Recent progress in salient object detection (SOD) mainly depends on the Atrous Spatial Pyramid Pooling (ASPP) module for multi-scale learning. Intuitively, different input images, different pixels, and different network layers may have different preferences for various feature scales. However, ASPP treats all feature scales as equally important by a simple sum operation. To this end, we propose Attentive Atrous Spatial Pyramid Pooling (A2SPP) by adding a new Cubic Information-Embedding Attention (CIEA) module at each branch of ASPP. In this way, each position in the 3D feature map can automatically learn the feature scales it prefers. Specifically, CIEA consists of Spatial-Embedding Channel Attention (SECA) and Channel-Embedding Spatial Attention (CESA). Instead of the previous direct squeeze and ignoring of one dimension when computing the attention for the other dimension, SECA/CESA attempts to embed spatial/channel information into channel/spatial attention, respectively. In addition, CIEA learns SECA and CESA for each 3D position simultaneously rather than previous separate computation of channel and spatial attention for each 2D position. Incorporating A2SPP and CIEA, the proposed A2SPPNet performs favorably against previous state-of-the-art SOD methods.

Abstract:
Blind face inpainting refers to the task of reconstructing visual contents without explicitly indicating the corrupted regions in a face image. Inherently, this task faces two challenges: (1) how to detect various mask patterns of different shapes and contents; (2) how to restore visually plausible and pleasing contents in the masked regions. In this paper, we propose a novel two-stage blind face inpainting method named Frequency-guided Transformer and Top-Down Refinement Network (FT-TDR) to tackle these challenges. Specifically, we first use a transformer-based network to detect the corrupted regions to be inpainted as masks by modeling the relation among different patches. For improved detection results, we also exploit the frequency modality as complementary information and capture the local contextual incoherence to enhance boundary consistency. Then a top-down refinement network is proposed to hierarchically restore features at different levels and generate contents that are semantically consistent with the unmasked face regions. Extensive experiments demonstrate that our method outperforms current state-of-the-art blind and non-blind face inpainting methods qualitatively and quantitatively.

Abstract:
Millions of people post images and texts to express their feelings and point of views on social media everyday, especially on the short text social media such as Twitter or Weibo. As the images can provide important supplementary information for the text, many multimodal topic models have been developed to mine the topics from the multimodal social media content. We summarize three fundamental characteristics of the short text multimodal social media. The first is that the text of a short social media document generally belong to only one topic. The second is that the attached images can be relevant to multiple topics due to the rich information expressed in the images. The last is that although in most cases, text and images in social media posts are relevant, it should be noted that in a small number of cases, text and pictures are not relevant. However, most of the current multimodal topic models fail to model the these characteristics, and thus may produce low-quality topics. Based on these characteristics, we propose an unsupervised multimodal topic model SMMTM to model the short text multimodal social media documents. In the SMMTM model, only one topic is sampled for the the text while an image can belong to different topics. The correlation of the topics between the text and the images in a document are also formulated in an appropriate way. The experiments on three short text social media datasets with four evaluation metrics show the advantages of our model over the existing models.

Abstract:
Deep-learning based watermarking framework has been extensively studied recently. The main structure of such framework is an encoder, a noise layer and a decoder. By training with different distortion sets in the noise layer, the whole network can realize different robustness. However, such framework has a huge drawback that the noise layer must be differentiable, otherwise it cannot be trained end-to-end. But for practical use, much distortions are non-differentiable, so such framework cannot be applied. To address such limitations, this paper propose a triple-phase watermarking framework for practical distortions. The proposed framework consists of three phases including a noise-free initial phase, a mask-guided frequency enhancement phase and an adversarial-training phase. Phase 1 aims to initialize an encoder to embed watermark with high visual quality and a decoder to extract the watermark. In order to generate high quality watermarked image, we design the just noticeable difference (JND)-mask image loss in phase 1 to guide the encoder. At phase 2, based on the investigation of the encoded features and distortions, we propose a mask-guided frequency enhancement algorithm to enhance the encoded feature which ensures the survival of such features after distortion, so that there will be enough features to be learned in phase 3. And phase 3 aims to train a stronger decoder to extract the watermark from the image after practical distortions. The combination of these 3 phases can well handle the non-differentiable problems and make the whole network trainable. Various experiments indicate the superior performance of the proposed scheme in the view of traditional differentiable image processing distortion robustness and practical non-differentiable distortion robustness.

Abstract:
The high-definition (HD) live video streaming has gained significant popularity due to the rapid growth of 4 G/5 G and social media. However, for devices with constrained bandwidth, they still have no sufficient bandwidth to support HD live video streaming. In this paper, we propose a neural-enhanced HD live video streaming framework called LiveSR to provide universal HD live video streaming for both bandwidth-constrained and bandwidth-rich devices. For bandwidth-constrained devices, LiveSR delivers low-quality video streams and then boosts video quality at the device side with super-resolution (SR) techniques. The difficulty lies in how to train the SR model with low cost and conduct quality enhancement in real time. To address these challenges, we design a crowdsourced online training method by exploiting computation resources and HD video data on bandwidth-rich devices in the same video channel. We also propose an imitation learning-based decision making algorithm to make downloading decisions for video chunks and SR models under limited bandwidth. We implement and evaluate our proposed LiveSR framework using real network traces, and the experiment results show that LiveSR outperforms all the other baseline approaches, with 65.5% improvement in terms of the average QoE and 5.7% in terms of video quality (i.e., PSNR), and the achieved frame rate can be as high as 30 frames per second.

Abstract:
Reversible data hiding in encrypted images (RDHEI) technique can be used to realize privacy protection and management in the image outsourcing scenario. Most existing RDHEI schemes focus on increasing the maximum embedding rate (Max-ER), but not paying much attention to the security improvement under various attacks. In this paper, a RDHEI method based on the adaptive bit-plane (ABP) coding is proposed to improve the Max-ER. The order-index extended scrambling (OIES) encryption scheme is also developed to strengthen the RDHEI's ability of thwarting various attacks. The effectiveness of ABP coding is achieved by proper selections of the threshold. The OIES enables the design of a novel scramble-key (SK) generation method to greatly reduce the probability of generating the same SK by the same user-key. This significantly improves the ability of resisting various attacks in that the attack on the scrambling encryption is mainly via the SK rather than the user-key estimation. Analysis shows that the probability of OIES obtaining the same SK is reduced from 1.0 to 0.01 for different images and to 1/2α for the same image. Simulation results demonstrate that the proposed ABP coding and OIES schemes outperform the state-of-the-art RDHEI algorithms in terms of the Max-ER and ability against various attacks.

Abstract:
Improving user experience during the delivery of immersive content is crucial for its success for both the content creators and audience. Creators can express themselves better with multisensory stimulation, while the audience can experience a higher level of involvement. The rapid development of mulsemedia devices provides better access for stimuli such as olfaction and haptics. Nevertheless, due to the required manual annotation process of adding mulsemedia effects, the amount of content available with sensorial effects is still limited. This work introduces an innovative mulsemedia-enhancement solution capable of automatically generating olfactory and haptic content based on 360° video content, with the use of neural networks. Two parallel neural networks are responsible for automatically adding scents to 360° videos: a scene detection network (responsible for static, global content) and an action detection network (responsible for dynamic, local content). A 360° video dataset with scent labels is also created and used for evaluating the robustness of the proposed solution. The solution achieves a 69.19% olfactory accuracy and 72.26% haptics accuracy during evaluation using two different datasets.

Abstract:
Webly-supervised fine-grained visual classification (FGVC) has attracted increasing attention in recent years because of the unaffordable cost of obtaining correctly-labeled large-scale fine-grained datasets. However, due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-grained (FG) models directly through web images tends to have an inferior recognition ability. In the literature, to alleviate this issue, loss correction methods try to estimate the noise transition matrix, but the inevitable false correction would cause accumulated errors. Sample selection methods identify clean (“easy”) samples based on the fact that small losses can alleviate the accumulated errors. However, “hard” and mislabeled examples that can both boost the robustness of FG models are also dropped. To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the network. Furthermore, in order to endow our model with the capability to capture richer and more discriminative feature representations, we propose a cross-layer attention-based feature refinement (CLAR) block. We demonstrate the superiority of the proposed approach from both theoretical and experimental perspectives.

Abstract:
With the rapid development of 3D construction technology, 3D models have been implemented in many applications. In particular, the fields of virtual and augmented reality have created a considerable demand for rapid access to large sets of 3D models in recent years. An effective method for addressing the demand is to search 3D models based on 2D images because 2D images can be easily captured by smartphones or other lightweight vision sensors. In this paper, we propose a novel unsupervised cross-media graph convolutional network (UCM-GCN) for 3D model retrieval based on 2D images. Here, we render views from 3D models to construct a graph model based on 3D model structural information. Then, we utilize the 2D image's visual information to bridge the gap between cross-modality data. Then, the proposed UCM-GCN is utilized to update the feature vector of the 2D image and the 3D model. Here, we introduce correlation loss to mitigate the distribution discrepancy across different modalities, which can fully consider the structural and visual similarities between the 2D image and 3D model to embed the final different modalities into the same feature space. To demonstrate the performance of our approach, we conducted a series of experiments on the MI3DOR dataset, which is utilized in SHREC19. We also compared it with other similar methods on the 3D-FUTURE dataset. The experimental results demonstrate the superiority of our proposed method over state-of-the-art methods.

Affiliations: School of Computer Science and Technology, Xi’an University of Posts and Telecommunications, Xi’an, Shaanxi, China; NPU-VUB Joint AVSP Research Laboratory, Shaanxi Key Laboratory on Speech and Image Information Processing, National Engineering Laboratory for Integrated Aero-Space-Ground-Ocean Big Data Application Technology, School of Computer Science, Northwestern Polytechnical University (NPU), Xi’an, China; Department ETRO, Vrije Universiteit Brussel (VUB), Brussels, Belgium; Department of Electronics & Informatics (ETRO), VUB-NPU joint AVSP Research Lab, Vrije Universiteit Brussel (VUB), Brussels, Belgium

Abstract:
Continuous affective state estimation from facial information is a task which requires the prediction of time series of emotional state outputs from a facial image sequence. Modeling the spatial-temporal evolution of facial information plays an important role in affective state estimation. One of the most widely used methods is Recurrent Neural Networks (RNN). RNNs provide an attractive framework for propagating information over a sequence using a continuous-valued hidden layer representation. In this work, we propose to instead learn rich affective state dynamics. We model human affect as a dynamical system and define the affective state in terms of valence, arousal and their higher-order derivatives. We then pose the affective state estimation problem as a jointly trained state estimator for high-dimensional input images, combining an RNN and a Bayesian Filter, i.e. Kalman filters (KF) and Extended Kalman filters (EKF), so that all weights in the resulting network can be trained using backpropagation. We use a recently proposed general framework for designing and learning discriminative state estimators framed as computational graphs. Such approach can handle high dimensional observations and efficiently optimize, in an end-to-end fashion, the state estimator. In addition, to deal with the asynchrony between emotion labels and input images, caused by the inherent reaction lag of the annotators, we introduce a convolutional layer that aligns features with emotion labels. Experimental results, on the RECOLA and SEMAINE datasets for continuous emotion prediction, illustrate the potential of the proposed framework compared to recent state-of-the-art models.

Abstract:
Accompanied with the increasing popularity of linear regression approaches, most of the existing minimization problems are related with several convex measurements, e.g., \ell _1/\ell _2/\ell _2,1-norm of a vector and L_1/L_2,1/Frobenius/nuclear norm of a matrix, where the regularized function and the loss function are usually studied for two objective terms case by case, respectively. To address this issue, this work combines these linear regression problems into a unified expression framework by employing an adaptive and flexible function, in which we need to choose different variable elements and adjust an inner parameter, properly. Besides this, they are equipped with some corresponding relationships and their interesting properties. Intuitively speaking, the proposed framework can generalize several traditional linear regression formulations and even more complex ones into an extended representation. For further optimizations, an iteratively re-weighted penalty solution (IRwPS) is devised without any inner loops, making the iteration programming easy to perform. Meanwhile, the theoretical results are provided for guaranteeing that the mathematical convergence analysis is solid and meaningful. Finally, by performing real-world applications in supervised, unsupervised, and semi-supervised tasks, numerical experiments are conducted to validate the theoretical properties and the superiority over some of the state-of-the-art.

Abstract:
Panoramic image quality assessment (PIQA) is crucial to the successful application of technologies that can provide immersive visual experience. Stitching distortions are one of the main types of distortions that result in panoramic image degradation. However, most existing PIQA methods are general-purpose ones, which ignore the special characteristics of the stitching distortions caused by imperfect stitching algorithms. This results in unsatisfactory performance. To this end, we propose an effective stitched PIQA method, which consists of an imaginary reference generation (IRG) module and a hierarchical quality prediction (HQP) module. Among them, the IRG module is proposed to mimic the capability of the human visual system in imagining the raw version in the face of a degraded image. For the IRG module learning, we construct a large-scale database. The HQP module is presented to adapt to the particularity and complexity of stitching distortions, which is achieved by the pyramid feature aggregation. Extensive experiments and comparisons have been performed on the stitched PIQA database and the experimental results demonstrate the superiority of the proposed method in evaluating the quality of stitched panoramic images.

Abstract:
A single superimposed image containing two image views causes visual confusion for both human vision and computer vision. Human vision needs a “develop-then-rival” process to decompose the superimposed image into two individual images, which effectively suppresses visual confusion. However, separating individual image views from a single superimposed image has been an important but challenging task in computer vision area for a long time. In this paper, we propose a human vision-inspired framework for single superimposed image decomposition. We first propose a network to simulate the development stage, which tries to understand and distinguish the semantic information of the two layers of a single superimposed image. To further simulate the rivalry activation/suppression process in human brains, we carefully design a rivalry stage, which incorporates the original mixed input (superimposed image), the activated visual information (outputs of the development stage) together, and then rivals to get images without ambiguity. Experimental results show that our novel framework effectively separates the superimposed images and significantly improves the performance with better output quality compared with state-of-the-art methods. The proposed method also achieves state-of-the-art results on related applications including single image reflection removal, single image rain removal, single image shadow removal, and illumination correction, etc., which validates the generalization of the framework.

Abstract:
Pseudo-label-based methods of unsupervised domain adaption (UDA) can transfer the knowledge learned from a labeled source domain to an unlabeled target domain and have recently achieved significant progress in the application of person reidentification (re-ID). However, these methods suffer from serious label noise problems that downgrade the retrieval performance in UDA person re-ID. The mutual teaching framework (MTF) with dual networks attempts to tackle this problem by generating reliable soft pseudo labels but results in a mutual convergence problem. In this paper, a novel DiveRsity EnlArged Mutual Teaching framework (DREAMT) is proposed to solve the problem mentioned above. Based on the primary mutual-mean-teaching mechanism two strategies are developed in DREAMT, that is, GAN-based source domain augmentation (GSDA) and cross-branch mutual supervision (CBMS) for dual networks. Specifically, GSDA exploits two GANs to augment source domain datasets in different ways for pre-training to improve the pre-trained models’ performance and enlarge the diversity at the beginning of target domain adaption. During target adaption, each network in MTF adopts two branches to extract different features. CBMS based on hard and soft pseudo labels is across branches and networks and can help to maintain the diversity between training peers in the whole training process. Extensive experiments have demonstrated that our proposed DREAMT framework achieves better mAP and CMC performance than the existing mutual teaching methods and outperforms various state-of-the-art methods in UDA person re-ID tasks.

Abstract:
With the rapid increase of large-scale and real-world person datasets, it is crucial to address the problem of long-tailed data distributions, i.e., head classes have large number of images while tail classes occupy extremely few samples. We observe that the imbalanced data distribution is likely to distort the overall feature space and impair the generalization capability of trained models. Nevertheless, this long-tailed problem has been rarely investigated in previous person Re-Identification (ReID) works. In this paper, we propose a novel Long-Tailed Re-Identification (LTReID) framework to simultaneously alleviate class-imbalance and hard-imbalance problems. Specifically, each real feature is decomposed into multiple independent components with two decorrelation losses. Then these components are randomly aggregated to generate more fake features for tail classes than head ones, resulting in the class-balance between head and tail classes. For the hard-balance between easy and hard samples, we utilize adversarial learning to generate more hard features than easy ones. The proposed framework can be trained in an end-to-end manner and avoids increasing the space and time complexity of inference models. Moreover, comprehensive experiments are conducted on the four ReID datasets so as to validate the effectiveness of the overall framework and the advantage of each module. Our results show that when trained with either balanced or imbalanced datasets, the LTReID achieves superior performance over the state-of-the-art methods.

Abstract:
Reconstructing a 3D human body mesh from a monocular image is a challenging inverse problem because of occlusion and complicated human articulations. Recent deep learning-based methods have made significant progress in single-image human reconstruction. Most of these works are either model-based methods or model-free methods. However, model-based methods always suffer detail losses due to the limited parameter space, and model-free methods are hard to directly recover satisfactory results from images due to the use of a shared global feature for all vertices and the domain gap between 2D regular images and 3D irregular meshes. To resolve these issues, we propose a hybrid model, which combines the advantages of both model based approach and model-free approach to estimate a 3D human mesh in a coarse-to fine manner. Initially, we utilize a convolutional neural network (CNN) to estimate the parameters of a Skinned Multi-Person Linear Model (SMPL), which allows us to generate a coarse human mesh. After that, the vertex coordinates of the coarse human mesh are further refined by a graph convolutional neural network (GCN). Unlike previous GCN-based methods, whose vertex coordinates are recovered from a shared global feature, we propose a LOcal CorRespondence-Aware (LOCRA) module to extract local special features for each vertex. To make the local features related to the human pose, we also add a keypoint-related loss to supervise the training process of the LOCRA module. Experiments demonstrate that our hybrid model with the LOCRA module outperforms existing methods on multiple public benchmarks.

Abstract:
Human silhouette segmentation, which is originally defined in computer vision, has achieved promising results for understanding human activities. However, the physical limitation makes existing systems based on optical cameras suffer from severe performance degradation under low illumination, smoke, and/or opaque obstruction conditions. To overcome such limitations, in this paper, we propose to utilize the radio signals, which can traverse obstacles and are unaffected by the lighting conditions to achieve silhouette segmentation. The proposed RFMask framework is composed of three modules. It first transforms RF signals captured by millimeter wave radar on two planes into spatial domain and suppress interference with the signal processing module. Then, it locates human reflections on RF frames and extract features from surrounding signals with human detection module. Finally, the extracted features from RF frames are aggregated with an attention based mask generation module. To verify our proposed framework, we collect a dataset containing 804,760 radio frames and 402,380 camera frames with human activities under various scenes. Experimental results show that the proposed framework can achieve impressive human silhouette segmentation even under the challenging scenarios (such as low light and occlusion scenarios) where traditional optical-camera-based methods fail. To the best of our knowledge, this is the first investigation towards segmenting human silhouette based on millimeter wave signals. We hope that our work can serve as a baseline and inspire further research that perform vision tasks with radio signals. The dataset and codes will be made in public.

Abstract:
Recently, blind image quality assessment (BIQA) models based on deep neural networks (DNNs) have achieved impressive performance on existing datasets. However, due to the intrinsic imbalance property of the training set, not all distortions or images are handled equally well. Online hard example mining (OHEM) is a promising way to alleviate this issue. Inspired by the recent finding that network pruning disproportionately hampers the model's memorization of a tractable subset, e.g., atypical, low-quality, long-tailed samples, which are hard-to-memorize during training and easily “forgotten” during pruning, we propose an effective “plug-and-play” OHEM pipeline for generalizable deep BIQA. Specifically, we train two parallel weight-sharing branches simultaneously, where one is full model and other is a “self-competitor” generated from the full model online by network pruning. Then, we leverage the prediction disagreement between the full model and its pruned variant (i.e., the self-competitor) to expose easily “forgettable” samples, which are therefore regarded as the hard ones. We enforce the prediction consistency between the full model and its pruned variant to implicitly put more focus on these hard samples, which benefits the full model to recover forgettable information introduced by pruning. Extensive experiments across multiple datasets and BIQA models demonstrate that the proposed OHEM can improve the model performance and generalizability as measured by correlation numbers and group maximum differentiation (gMAD) competition.

Abstract:
This article focuses on generating object locations in a given image while only using image-level annotations. Towards this end, we present a simple and effective training-free framework, named Dual-Gradients Localization (DGL) framework. The key idea of the proposed DGL framework is to leverage two kinds of gradients to achieve precise localization on any convolutional layer of a classification model during the testing stage. Concretely, the DGL framework is developed based on two branches: 1) Pixel-level Class Selection, leveraging gradients of the target class to identify the correlation ratio of pixels to the target class within any convolutional feature maps, and 2) Class-aware Enhanced Maps, utilizing linear relationship in gradients of the classification loss function to mine entire target object regions. To further polish the details of objects, we apply the skip-layer connections to the classification model, which concatenates the high- and low-level layers to achieve classification. In such a case, DGL with Skip-layer Connections (DGL-SC) can capture more edge information on the high-level layer. In addition, we propose a Localization Maps Selection method to evaluate the quality of the localization map and provide a way for automatically selecting localization maps produced on different layers. Extensive experiments on public ILSVRC and CUB-200-2011 datasets show the effectiveness of the proposed DGL framework. Especially, our DGL-SC obtains a new state-of-the-art gt-known localization error of 27.35% on the ILSVRC benchmark.

Abstract:
Multi-view subspace clustering aims to utilize the comprehensive information of multi-source features to aggregate data into multiple subspaces. Recently, low-rank tensor learning has been applied to multi-view subspace clustering, which explores high-order correlations of multi-view data and has achieved remarkable results. However, these existing methods have certain limitations: 1) The learning processes of low-rank tensor and label indicator matrix are independent. 2) Variable contributions of different views to the consistent clustering results are not discriminated. To handle these issues, we propose a unified framework that integrates low-rank tensor learning and spectral embedding (ULTLSE) for multi-view subspace clustering. Specifically, the proposed model adopts the tensor singular value decomposition (t-SVD) based tensor nuclear norm to encode the low-rank property of the self-representation tensor, and a label indicator matrix via spectral embedding is simultaneously exploited. To distinguish the importance of various views, we learn a quantifiable weighting coefficient for each view. An effective recursion optimization algorithm is also developed to address the proposed model. Finally, we conduct comprehensive experiments on eight real-world datasets with three categories. The experimental results indicate that the proposed ULTLSE is advanced over existing state-of-the-art clustering methods.

Abstract:
Just Noticeable Distortion (JND) finds the minimum distortion level perceivable by humans. This can be a natural solution for setting the compression for each video region in perceptual video coding. However, existing JND-based solutions estimate JND levels for each video frame and ignore the fact that different video regions have different perceptual importance. To address this issue, we propose a Block-Level Just Noticeable Distortion-based Perceptual (BL-JUNIPER) framework for video coding. The proposed four-stage framework combines different perceptual information to further improve the prediction accuracy. The JND mapping in the first stage derives block-level JNDs from frame-level information without the need to collect a new bock-level JND dataset. In the second stage, an efficient CNN-based model is proposed to predict JND levels for each block according to spatial and temporal characteristics. Unlike existing methods, BL-JUNIPER works on raw video frames and avoids re-encoding each frame several times, making it computationally practical. Third, the visual importance of each block is measured using a visual attention model. Finally, a proposed quantization control algorithm uses both JND levels and visual importance to adjust the Quantization Parameter (QP) for each block. The specific algorithm for each stage of the proposed framework can be changed, as long as the input and output formats of each block are followed, without the need to change other stages, based on any current or future methods, providing a flexible and robust solution. Extensive experimental results demonstrate that BL-JUNIPER achieves a mean bitrate reduction of 27.75% with a Delta Mean Opinion Score (DMOS) close to zero and BD-Rate gains of 25.44% based on MOS, compared to the baseline encoding, and also gains a better performance compared to competing methods.

Abstract:
Modern crowd counting methods in natural scenes, even when video datasets are available, are mostly based on images. Because of background interference or occlusion in the scene, these methods can easily lead to mutations and instability in density prediction. There has been minimal research on how to exploit the inherent consistency among adjacent frames to achieve high estimation accuracy of video sequences. In this study, we explore the long-term global temporal consistency in the video sequence and propose a novel Global Representation Guided Adaptive Fusion Network (GRGAF) for video crowd counting. The primary aim is to establish a long-term temporal representation among consecutive frames to guide the density estimation of local frames, which can alleviate the prediction instability caused by background noise and occlusions in crowd scenes. Moreover, in order to further enforce the temporal consistency, we apply the generative adversarial learning scheme and design a global-local joint loss, which can make the estimated density maps more temporally coherent. Extensive experiments on four challenging video-based crowd counting datasets (FDST, DroneCrowd, MALL and UCSD) demonstrate that our method makes effective use of spatio-temporal information of video and outperforms the other state-of-the-art approach.

Abstract:
Visual relationship understanding plays an indispensable role in grounded language tasks like visual question answering (VQA), which often requires precisely reasoning about relations among objects depicted in the given question. However, prior works generally suffer from the deficiencies as follows, (1) spatial-relation inference ambiguity, it is challenging to accurately estimate the distance of a pair of visual objects in 2D space if there is a visual-overlap between their 2D bounding-boxes, and (2) language-visual relational alignment missing, it is insufficient to generate a high-quality answer to the question if there is a lack of alignment in the language-visual relations of objects during fusion, even using a powerful fusion model like Transformer. To this end, we first model the spatial relation of a pair of objects in 3D space by augmenting the original 2D bounding-box with 1D depth information, and then propose a novel model named Depth-aware Semantic Guided Relational Attention Network (DSGANet), to explicitly exploit the formed 3D spatial relations of objects in an intra-/inter-modality manner for precise relational alignment. Extensive experiments conducted on the benchmarks (VQA v2.0 and GQA) demonstrate DSGANet achieves competitive performance compared to pretrained and non-pretrained models, such as 72.7% vs. 74.6% based on the learned grid features on VQA v2.0.

Abstract:
Most of the existing image selection-based coverless image steganography methods mainly focus on improving the capacity and robustness under the assumption that the corresponding dataset is available. But they ignore how to successfully construct the coverless image dataset, which is the foundation of such methods and has a critical impact on the capacity. In this paper, a coverless image steganography is proposed that considers how to efficiently construct the coverless image dataset. In the proposed method, the CNN-based deep hash is extracted from the image and a specific mapping rule is designed to map the high-dimensional deep hash to the low-dimensional secret message. In addition, an unsupervised clustering algorithm is adopted to construct the coverless image dataset, which makes the construction of the coverless image dataset efficient and improves the robustness of the proposed steganography method. To our best knowledge, this is the first attempt to improve the construction efficiency of the coverless image dataset in the field of coverless image steganography. Experimental results show that the construction of a large coverless image dataset is feasible and reliable, and the proposed method has better robustness and higher dataset utilization rate compared with the state-of-the-art methods.

Abstract:
With the development of computer vision, the semantic segmentation of remote sensing images, which has become an important topic, has been utilized in various applications for image content analysis and understanding, such as urban planning, natural disaster monitoring, and land resource management. Many approaches have been proposed to address these problems. However, due to obvious differences in resolution, spatial structure, and semantics between remote sensing images and ordinary images, the semantic segmentation of remote sensing images is still challenging. In this paper, we propose a novel multiscale image generation network (MIGN) that can efficiently generate high-resolution segmentation results by considering both details and boundary information. In particular, a multi-attention mechanism method for semantic segmentation of remote sensing images is designed. The attention weight is calculated by capturing the interaction of cross dimensions in a two-branch structure, which can learn the underlying feature information and guarantee the performance of each pixel feature for final classification. We also propose an edge supervised module to ensure that the segmentation boundary has a more accurate performance. A multiscale image fusion algorithm based on the Bayes model is proposed to improve the accuracy of the segmentation module. The performance of our model is evaluated on the ISPRS Vaihingen and Potsdam datasets. The results show that our method is superior to the most advanced image segmentation methods in terms of MIoU and pixel accuracy.

Abstract:
In this paper, we propose a novel deep decomposition approach based on Retinex theory for multi-exposure image fusion, termed as DMEF. According to the assumption of Retinex theory, we firstly decompose the source images into illumination and reflection maps by the data-driven decomposition network, among which we introduce the pathwise interaction block that reactivates the deep features lost in one path and embeds them into another path. Therefore, loss of illumination and reflection features during decomposition can be effectively suppressed. And then the high dynamic range illumination map could be obtained by fusing the separated illumination maps in the fusion network. Thus, the reconstructed details in under-exposed and over-exposed regions will be clearer with the help of the fused reflection map which contains complete high-frequency scene information. Finally, the fused illumination and reflection maps are multiplied pixel-by-pixel to obtain the final fused image. Moreover, to retain the discontinuity in the illumination map where gradient of reflection map changes steeply, we introduce the structure-preservation smoothness loss function to retain the structure information and eliminate visual artifacts in these regions. The superiority of our proposed network is demonstrated by applying extensive experiments compared with other state-of-the-art fusion methods subjectively and objectively.

Abstract:
Object co-segmentation (CSG) is to segment the common objects of the same category in multiple relevant images while the co-saliency detection (CSD) aims to discover the salient and common foreground objects in a group of images. To process both tasks simultaneously, this paper presents an adaptive spatially and high-order semantically modulated deep network framework. A backbone network is first adopted to extract multi-resolution image features. With the multi-resolution features of the relevant images as input, we design an adaptive spatial modulator to learn a spatial representation that can highlight the co-object regions for each image. The adaptive spatial modulator fully captures the rich correlations of all image feature descriptors via unsupervised clustering and a graph aggregation strategy. The learned representation can well localize the common foreground object while effectively suppressing the background signals. For the high-order semantic modulator, we model it as a supervised image classification task. We propose a hierarchical high-order pooling module to learn the rich semantic features for classification use. The outputs of the two modulators manipulate the multi-resolution features by a shift-and-scale operation so that the features focus on segmenting common object regions. The proposed model is trained end-to-end without any intricate post-processing. Extensive experiments on three CSG benchmark datasets (MSRC, i-Coseg, and PASCAL-VOC) and three CSD datasets (Cosal2015, CoCA, and CoSOD3k) demonstrate the superior accuracy of the proposed method compared to state-of-the-art methods on both tasks.

Abstract:
Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been widely used in correspondence to each single task. This may lead to the representation learned by the model being task-specific, and inevitably result in the lack of generalization ability of the feature based on multi-modal modeling. More recent studies have shown that establishing cross-modal relationship between auditory and visual stream is a promising solution for the challenge of audio-visual multi-task learning. Therefore, as a motivation to bridge the multi-modal associations in audio-visual tasks, a unified framework is proposed to achieve target speaker detection and speech enhancement with joint learning of audio-visual modeling in this study. With the assistance of audio-visual channels of videos in challenging real-world scenarios, the proposed method is able to exploit inherent correlations in both audio and visual signals, which is used to further anticipate and model the temporal audio-visual relationships across spatial-temporal space via a cross-modal conformer. In addition, a plug-and-play multi-modal layer normalization is introduced to alleviate the distribution misalignment of multi-modal features. Based on cross-modal circulant fusion, the proposed model is capable to learned all audio-visual representations in a holistic process. Substantial experiments demonstrate that the correlations between different modalities and the associations among diverse tasks can be learned by the optimized model more effectively. In comparison to other state-of-the-art works, the proposed work shows a superior performance for active speaker detection and audio-visual speech enhancement on three benchmark datasets, also with a favorable generalization in diverse challenges.

Abstract:
Tone mapping operators (TMO) are functions that map high dynamic range (HDR) images to a standard dynamic range (SDR), while aiming to preserve the perceptual cues of a scene that govern its visual quality. Despite the increasing number of studies on quality assessment of tone mapped images, current subjective quality datasets have relatively small numbers of images and subjective opinions. Moreover, existing challenges in transferring laboratory experiments to crowdsourcing platforms put a barrier for collecting large-scale datasets through crowdsourcing. In this work, we address these challenges and propose the RealVision-TMO (RV-TMO), a large-scale tone mapped image quality dataset. RV-TMO contains 250 unique HDR images, their tone mapped versions obtained using four TMOs and pairwise comparison results from seventy unique observers for each pair. To the best of our knowledge, this is the largest dataset available in the literature for quality evaluation of TMOs by the number of tone mapped images and number of annotations. Furthermore, we provide a content selection strategy to identify interesting and challenging HDR images. We also propose a novel methodology for observer screening in pairwise experiments. Our work does not only provide annotated data to benchmark existing objective quality metrics, but also paves the path to building new metrics for tone mapping quality evaluation.

Abstract:
Multimodal sentiment analysis aims to extract emotions with multiple data sources, usually under the assumption that all modalities are available. In practice, such a strong assumption does not always hold, and most of multimodal sentiment analysis methods may fail when partial modalities are missing. Some existing works have started to address the missing modality problem; but only considered the single modality missing case, while ignoring the practically more general cases of multiple modalities missing. To this end, in this paper, we propose a Tag-Assisted Transformer Encoder (TATE) network to handle the problem of missing uncertain modalities. Specifically, we design a tag encoding module to cover both the single modality and multiple modalities missing cases, so as to guide the network's attention to those missing modalities. Besides, a new space projection pattern is adopted to align common vectors, taking into account the different importance of each modality. Afterwards, a Transformer encoder-decoder network is utilized to learn the missing modality features, and the outputs of the Transformer encoder are extracted for the final sentiment classification. Extensive experiments and analyses are conducted on CMU-MOSI, IEMOCAP, and MELD datasets, which show that the proposed method can achieve significant improvements compared with several baselines.

Affiliations: Department of Mathematics and Theories, Peng Cheng Laboratory (PCL), Shenzhen, China; Tsinghua-Berkeley Shenzhen Institute, Tsinghua University, Shenzhen, China; Multimedia Laboratory, Beijing Bytedance Technology Company, Ltd., Beijing, China; Division of Information Science and Technology, Tsinghua Shenzhen International Graduate School, Shenzhen, China; Department of Broadband Communication, Peng Cheng Laboratory (PCL), Shenzhen, China; Performance Engineering Laboratory, Dublin City University, Dublin, Ireland

Abstract:
In the context of the latest growing popularity of live video streaming, ensuring high video quality has become one of the most significant challenges faced by all live streaming platforms. Insufficient uplink bandwidth is an important factor that influences these live video transmissions, affecting their bitrate and latency and consequently the associated video streaming quality. This paper proposes a novel flexible super-resolution-based video coding and uploading framework (FlexSRVC) that improves the quality of live video streaming in limited uplink network bandwidth conditions. FlexSRVC includes a flexible video coding scheme, which compresses high-resolution key and non-key video frames to a lower bitrate in order to reduce the upload delay. A new flexible bitrate adaptation algorithm is also proposed to select dynamically the number of frames to be compressed and the compression ratio by jointly considering uplink network conditions and available cloud computing resources. Trace-driven emulations demonstrate that FlexSRVC provides the same quality while reducing up to 25% of the required bandwidth compared to the original encoding method (H.264). FlexSRVC improves users' QoE by at least 50% compared to a super resolution-based method which employs reconstruction of all video frames in uplink bandwidth constrained conditions.

Abstract:
Text style transfer is an important task to render artistic texts from a reference image or style, and is widely desired in many visual creations. Previous works have brought some efficient methods for text style transfer, which facilitate users to design various artistic texts automatically. However, these works mainly focus on relatively simple text effects, and do not perform well on complex reference styles. In this paper, we propose a coarse-to-fine framework to generate exquisite texts with complex texture and structure in an unsupervised way, achieving real-time control of style scales (i.e., text stylistic degree or deformation degree). The key idea is to decouple the overall task into two steps, prototype generation and detail refinement, and explore delicate networks for each step to imitate the features at different levels. Based on this idea, in the first step, we present a novel pro-gen GAN to generate prototypes of artistic texts using the reference style, and develop a deformable module to empower the pro-gen GAN to continuously characterize the multi-scale shape features without network retraining. Furthermore, we propose a mix-attention training scheme for text style transfer, which can avoid artifacts and retain a clear text background. In the second step, we introduce two optimized networks for detail refinements. Experimental results show that the proposed method can synthesize exquisite stylized texts with complex reference styles, and surpass the state of the arts in texture reconstruction, contour imitation, and text image quality drastically.

Abstract:
Fine-grained image classification attempts to accurately classify images that are similar to each other. Multiview information is often used to improve the classification accuracy. Although great progress has been made, fine-grained image classification methods still have two drawbacks. On the one hand, they often treat each image independently without considering image correlations within the same class along with the distinctive characters of each image. On the other hand, multiview correlations are often used during classifier training, leaving the correlations of different views unconsidered. To solve these two problems, in this paper, we propose a novel fine-grained image classification method by class and image-specific decomposition with multiviews (CISD-MV). For each view, we treat images of the same class jointly by decomposing the class and image-specific information. Since images of different classes are similar and correlated, we linearly model class correlations of images using decomposed low-rank parts. In addition, for each image, the representations of different views are correlated, and we use linear transformation to model view correlations. We jointly optimise for the class and image-specific components along with the class correlation and view correlation transformation matrixes. A testing image is assigned to the class that has the minimum summed reconstruction error. We conduct fine-grained image classification experiments on several public fine-grained image datasets. Experimental results and analysis show the effectiveness of the proposed method.

Abstract:
Bandwidth prediction is critical in any Real-time Communication (RTC) service or application. This component decides how much media data can be sent in real time. Subsequently, the video and audio encoder dynamically adapts the bitrate to achieve the best quality without congesting the network and causing packets to be lost or delayed. To date, several RTC services have deployed the heuristic-based Google Congestion Control (GCC), which performs well under certain circumstances and falls short in some others. In this paper, we leverage the advancements in reinforcement learning and propose BoB(Bang-on-Bandwidth) — a hybrid bandwidth predictor for RTC. At the beginning of the RTC session, BoBuses a heuristic-based approach. It then switches to a learning-based approach. BoBpredicts the available bandwidth accurately and improves bandwidth utilization under diverse network conditions compared to the two winning solutions of the ACM MMSys'21 grand challenge on bandwidth estimation in RTC. An open-source implementation of BoBis publicly available for further testing and research.

Abstract:
Low-rank tensor completion has been widely used in computer vision and machine learning. This paper develops a novel multimodal core tensor factorization (MCTF) method combined with a tensor low-rankness measure and a better nonconvex relaxation form of this measure (NC-MCTF). The proposed models encode low-rank insights for general tensors provided by Tucker and T-SVD and thus are expected to simultaneously model spectral low-rankness in multiple orientations and accurately restore the data of intrinsic low-rank structure based on few observed entries. Furthermore, we study the MCTF and NC-MCTF regularization minimization problem and design an effective block successive upper-bound minimization (BSUM) algorithm to solve them. Theoretically, we prove that the iterates generated by the proposed models converge to the set of coordinatewise minimizers. This efficient solver can extend MCTF to various tasks such as tensor completion. A series of experiments including hyperspectral image (HSI), video and MRI completion confirm the superior performance of the proposed method.

Abstract:
Parkinson's disease (PD) is a neurodegenerative disease which is prevalent among the elder population and severely affects the life quality of patients and their families. Therefore, it is important to conduct an early diagnosis for potential patients with PD, so as to promote prompt treatment and avoid the aggravation of the disease. Recently, the in-vitro PD diagnosis based on facial expressions has received increasing attention because of its distinguishability (i.e., PD patients always possess the characteristics of “masked face”) and affordability. However, the performance of the existing facial expression-based PD diagnosis approaches is limited by: 1) the small-scale training data on PD patients' facial expressions, and 2) the weak prediction model. To address these two problems, we propose a new facial expression guided PD diagnosis method based on high-quality training data augmentation and deep neural network prediction. Specifically, the proposed method consists of three stages: Firstly, we synthesize virtual facial expression images with 6 basic emotions (i.e., anger, disgust, fear, happiness, sadness, and surprise) based on multi-domain adversarial learning to approximate the premorbid expressions of PD patients. Secondly, we introduce three facial image quality assessment (FIQA) criteria to measure the quality of these synthesized facial expression images and design a fusion screening strategy that shortlists the high-quality ones to augment the training data. Finally, we train a deep neural network prediction model based on the original and synthesized high-quality facial expression images for PD diagnosis. To show real-world impacts and evaluate the proposed method under different facial expressions, we also create a (currently largest) multiple facial expressions-based PD face dataset in collaboration with a hospital. Extensive experiments are performed to demonstrate the effectiveness of the multi-domain adversarial learning-based facial expression synthesis and the fusion screening strategy, particularly the superior performance of the proposed method for PD diagnosis.

Abstract:
Volumetric video provides a more immersive holographic virtual experience than conventional video services such as 360-degree and virtual reality (VR) videos. However, due to ultra-high bandwidth requirements, existing compression and transmission technology cannot handle the delivery of real-time volumetric video. Unlike traditional compression methods and the approaches that extend 360-degree video streaming, we propose AITransfer, an AI-powered compression and semantic-aware transmission method for point cloud video data (a popular volumetric data format). AITransfer targets the semantic-level communication beyond transmitting raw point cloud video or compressed video with two outstanding contributions: (1) designing an integrated end-to-end architecture with two fundamental contents of feature extraction and reconstruction to reduce the bandwidth consumption and alleviate the computational pressure; and (2) incorporating the dynamic network condition into end-to-end architecture design and employing a deep reinforcement learning-based adaptive control scheme to provide robust transmission. We conduct extensive experiments on the typical datasets and develop a case study to demonstrate the efficiency and effectiveness. The results show that AITransfer can provide extremely efficient point cloud transmission while maintaining considerable user experience with more than 30.72x compression ratio under the existing network environments.

Abstract:
With the increasing popularity of 3D objects in industry and everyday life, 3D object security has become essential. While there exists methods for 3D selective encryption, where a clear 3D object is encrypted so that the result has the desired level of visual security, to our knowledge, no method exists for decrypting encrypted 3D objects hierarchically. In this paper, we are the first to propose propose a method which allows us to hierarchically decrypt an encrypted 3D object according to a generated ring of keys. This ring consists of a set of keys that allow a stronger or weaker decryption of the encrypted 3D object. Each hierarchically decrypted 3D object has a different visual security level, where the 3D object is more or less visually accessible. Based on a master key, these hierarchical keys are generated using our method during the encryption process. Our method is essential when it comes to preventing trade secrets from being leaked from within a company or by exterior attackers. It is also ecologically friendly and more secure than traditional selective encryption methods.

Abstract:
Visual retrieval system faces frequent model update and deployment. It is a heavy workload to re-extract features of the whole database every time. Feature compatibility enables the learned new visual features to be directly compared with the old features stored in the database. In this way, when updating the deployed model, we can bypass the inflexible and time-consuming feature re-extraction process. However, the old feature space that needs to be compatible is not ideal and faces outlier samples. Besides, the new and old models may be supervised by different losses, which will further causes distribution discrepancy problem between these two feature spaces. In this article, we propose a global optimization Dual-Tuning method to obtain feature compatibility against different networks and losses. A feature-level prototype loss is proposed to explicitly align two types of embedding features, by transferring global prototype information. Furthermore, we design a component-level mutual structural regularization to implicitly optimize the feature intrinsic structure. Experiments are conducted on six datasets, including person ReID datasets, face recognition datasets, and million-scale ImageNet and Place365. Experimental results demonstrate that our Dual-Tuning is able to obtain feature compatibility without sacrificing performance.

Abstract:
Image Retrieval with Text Feedback (IRTF) is an emerging research topic where the query consists of an image and a text expressing a requested attribute modification. The goal is to retrieve the target images similar to the query text modified query image. The existing methods usually adopt feature fusion of the query image and text to match the target image. However, they ignore two crucial issues: overfitting and low diversity of training data, which make the feature fusion based IRTF task not generalizable. Conventional generation based data augmentation is an effective way to alleviate overfitting and improve diversity, but increases the volume of training data and generation model parameters, which is bound to bring huge computation costs. By rethinking the conventional data augmentation mechanism, we propose a plug-and-play Gradient Augmentation (GA) based regularization approach. Specifically, GA contains two items: 1) To alleviate model overfitting on the training set, we deduce an explicit adversarial gradient augmentation from the perspective of adversarial training, which challenges the “no free lunch” philosophy. 2) To improve the diversity of training set, we propose an implicit isotropic gradient augmentation from the perspective of gradient descent-based optimization, which achieves the goal of big gain but no pain. Besides, we introduce deep metric learning to train the model and provide theoretical insights of GA on generalisation. Finally, we propose a new evaluation protocol called Weighted Harmonic Mean (WHM) to assess the model generalisation. Experiments show that our GA outperforms the state-of-the-art methods by 6.2 and 4.7% on CSS and Fashion200 k datasets, respectively, without bells and whistles.

Abstract:
Depth estimation attracts widespread attention in the computer vision community. However, it is still quite difficult to recover an accurate depth map using only one RGB image. We observe a phenomenon that existing methods tend to fail in different cases, caused by differences in network architecture, loss function and so on. In this work, we investigate into the phenomenon and propose to integrate the strengths of multiple weak depth predictor to build a comprehensive and accurate depth predictor, which is critical for many real-world applications, e.g., 3D reconstruction. Specifically, we construct multiple base (weak) depth predictors by utilizing different Transformer-based and convolutional neural network (CNN)-based architectures. Transformer establishes long-range correlation while CNN preserves local information ignored by Transformer due to the spatial inductive bias. Therefore, the coupling of Transformer and CNN contributes to the generation of complementary depth estimates, which are essential to achieve a comprehensive depth predictor. Then, we design mixers to learn from multiple weak predictions and adaptively fuse them into a strong depth estimate. The resultant model, which we refer to as Transformer-assisted depth ensembles (TEDepth). On the standard NYU-Depth-v2 and KITTI datasets, we thoroughly explore how the neural ensembles affect the depth estimation and demonstrate that our TEDepth achieves better results than previous state-of-the-art approaches. To validate the generalizability across cameras, we directly apply the models trained on NYU-Depth-v2 to the SUN RGB-D dataset without any fine-tuning, and the superior results emphasize its strong generalizability.

Abstract:
Blind image quality assessment (BIQA) that can directly evaluate image quality without perfect-quality reference has been a long-standing research topic. Although the existing BIQA models have achieved very encouraging performance, the lack of explainability and generalization ability limits their real-world applications to a great extent. People usually assess image quality according to semantic attributes, e.g., brightness, color, contrast, noise and sharpness. Furthermore, judgment on image quality is also impacted by the scene presented in the image. Therefore, the inherent relationship between semantic attributes and scenes is crucial for image quality assessment, which has rarely been explored yet. With this motivation, this paper presents a Semantic Attribute Reasoning based image QUality Evaluator (SARQUE). Specifically, we propose a two-stream network to predict semantic attributes and scene categories from distorted images. To investigate the inherent relationship between the semantic attributes and scene category, a semantic reasoning module is further proposed based on the graph convolution network (GCN), producing the final quality score. Extensive experiments conducted on five in-the-wild image quality databases demonstrate the superiority of the proposed SARQUE model over the state-of-the-arts. Furthermore, the proposed model features better explainability and generalization ability due to the use of semantic attributes.

Abstract:
The existing generative adversarial fusion methods generally concatenate source images or deep features, and extract local features through convolutional operations without considering their global characteristics, which tends to produce a limited fusion performance. Toward this end, we propose a novel interactive compensatory attention fusion network, termed ICAFusion. In particular, in the generator, we construct a multi-level encoder-decoder network with a triple path, and design infrared and visible paths to provide additional intensity and gradient information for the concatenating path. Moreover, we develop the interactive and compensatory attention modules to communicate their pathwise information, and model their long-range dependencies through a cascading channel-spatial model. The generated attention maps can more focus on infrared target perception and visible detail characterization, and are used to reconstruct the fusion image. Therefore, the generator takes full advantage of local and global features to further increase the representation ability of feature extraction and feature reconstruction. Extensive experiments illustrate that our ICAFusion obtains superior fusion performance and better generalization ability, which precedes other advanced methods in the subjective visual description and objective metric evaluation.

Abstract:
Image manipulation localization is a technique that can efficiently segment the tampered regions from a suspicious image. Existing work usually trains a detection model by fusing the features from diverse data streams, e.g., noise inconsistency, recompression inconsistency, and local inconsistency. They, however, ignore a fact that not all tampered images contain these data streams. As a result, high feature redundancy may cause a large number of false detection for tampered region. To address this problem, this paper designs an end-to-end high-confidence localization network architecture. First, deep convolutional neural networks are utilized to extract multi-scale feature sets from the RGB streams. We then design a semantic refined bi-directional feature integration module to fully fuse multi-scale adjacent features and significantly enhance feature representation. Subsequently, morphological operations are introduced to extract multi-scale edge information, which can efficiently reduce feature redundancy by generating wider high-resolution edges during image reconstructing. Finally, a deep semantic residual decoder is sequentially re-constructed by spreading deep semantic information into each decoding stage. The proposed method can not only improve the manipulation localization accuracy, but also guarantee the model robustness. Extensive experiments demonstrate that our method can obtain an effective performance in locating forged regions over different large-scale image sets, and outperforms most of state-of-the-art methods with higher localization accuracy and stronger robustness.

Abstract:
Describing a video using natural language is an inherently one-to-many translation task. To generate diverse captions, existing VAE-based generative models typically learn factorized latent codes via one-stage training merely from stand-alone video-caption pairs. However, such a paradigm neglects set-level relationships among captions from the same video, not fully capturing the underlying multimodality of the generative process. To overcome this shortcoming, we leverage neighbouring descriptions for the same video that are articulated with noticeable topics and language variations (i.e., paraphrases). To this end, we propose a novel progressive training method by decomposing the learning of latent variables into two stages that are topic-oriented and paraphrase-oriented, respectively. Specifically, the model learns from divergent topic sentences obtained by semantic-based clustering in the first stage. It is then trained again through paraphrases with a cluster-aware adaptive regularization, allowing more intra-cluster variations. Furthermore, we introduce an overall metric DAUM, a Diversity-Accuracy Unified Metric to consider both the precision of the generated caption set and its coverage on the reference set, which has proved to have a higher correlation with human judgment than previous precision-only metrics. Extensive experiments on three large-scale video datasets show that the proposed training strategy can achieve superior performance in terms of accuracy, diversity, and DAUM over several baselines.

Abstract:
Multimodal fake news detection has obtained increasing attention recently. Existing works generally encode multimodal contents into a deterministic point in semantic subspaces, and then fuse multimodal features by simple concatenation or attention mechanisms. However, most methods suffer from adapting to noisy multimodal contents since they neglect the robustness of modality-specific features. Besides, as different modalities usually have varying confidence levels, previous attention-based fusion models that learn modality-independent weights based on the input data feature, would limit the optimal integration of multimodal contents. To alleviate the above issues, we propose novel Multimodal Uncertainty Learning Network (MM-ULN) to enhance multimodal fake news detection by modeling both intra- and inter-modality uncertainty. Specifically, we incorporate a novel intra-modality uncertainty learning (EUL) module to better understand noisy multimodal contents. EULs provide feature regularization in a variational way, successfully alleviating the effects of data uncertainty within modalities. We design a new variational attention fusion (VAF) module to adaptively fuse multimodal contents with modality-dependent weights. The VAF module consider the relative confidence between modalities and enables to explore complementary properties for detection. Extensive experiments on two benchmark datasets demonstrate the effectiveness and superiority of MM-ULN on multimodal fake news detection.

Abstract:
To support indoor scene understanding, room layouts have been recently introduced that define a few typical space configurations according to junctions and boundary lines. In this paper, we study camera pose estimation from eight common room layouts with at least two boundary lines that is cast as a PnL (Perspective-n-Line) problem. Specifically, the intersecting points between image borders and room layout boundaries, named image outer corners (IOCs), are introduced to create additional auxiliary lines for PnL optimization. Therefore, a new PnL-IOC algorithm is proposed which has two implementations according to the room layout types. The first one considers six layouts with more than two boundary lines where 3D correspondence estimation of IOCs creates sufficient line correspondences for camera pose estimation. The second one is an extended version to handle two challenging layouts with only two coplanar boundaries where correspondence estimation of IOCs is ill-posed due to insufficient conditions. Thus the powerful NSGA-II algorithm is embedded in PnL-IOC to estimate the correspondences of IOCs. In the last step, the camera pose is jointly optimized with 3D correspondence refinement of IOCs in the iterative Gauss-Newton algorithm. Experiment results on both simulated and real images show the advantages of the proposed PnL-IOC method on the accuracy and robustness of camera pose estimation from eight different room layouts over the existing PnL methods.

Abstract:
This paper presents the Cellular Binary Neural Network (CBNN), which is an efficient deep neural network with binary weights and activations. To address the challenge of performance drop caused by low-precision representation, the CBNN adopts multiple subnets which are connected via learnable global lateral paths. The introduced lateral connections are assumed to be sparse and grouped with respect to different source layers. The inter-network lateral connections and inner-network parameters are simultaneously optimized by the distributional loss, classification loss and the group sparse regularization term. Experiments on the CIFAR-10 and ImageNet datasets showed that, by incorporating optimized group-sparse lateral paths, the CBNN outperformed many state-of-the-art binary neural networks in terms of classification accuracy. Besides, to verify the generalization of the proposed binary model, we extended the CBNN on semantic segmentation task. CBNN takes advantage of the multiple subnets to derive the more informative feature maps which are computed by the parallel aggregation in the last convolution block. Experiments on PASCAL VOC segmentation dataset demonstrated that, under the same segmentation settings, the proposed method achieved the superior performance over other compared networks and even the full-precision counterpart.

Abstract:
Learned video compression has developed rapidly and achieved impressive progress in recent years. Despite efficient compression performance, existing signal fidelity oriented or semantic fidelity oriented video compression methods limit the capability to meet the requirements of both machine and human vision. To address this problem, a task-driven video compression framework is proposed to flexibly support vision tasks for both human vision and machine vision. Specifically, to improve the compression performance, the backbone of the video compression framework is optimized by using three novel modules, including multi-scale motion estimation, multi-frame feature fusion, and reference based in-loop filters. Then, based on the proposed efficient compression backbone, a task-driven optimization approach is designed to achieve the trade-off between signal fidelity oriented compression and semantic fidelity oriented compression. Moreover, a post-filter module is employed for the framework to further improve the performance of the human vision branch. Finally, rate-distortion performance, rate-accuracy performance, and subjective quality are employed as the evaluation metrics, and experimental results show the superiority of the proposed framework for both human vision and machine vision.

Abstract:
To improve users' experience and decrease their likelihood of quitting watching videos, this paper addresses the question of how to encode the videos used in adaptive bitrate (ABR) video streaming. When addressing ABR video streaming, a lot of effort has been put into developing ABR control schemes. However, ways to appropriately encode videos also need to be defined. Unlike previous approaches that focus on coding quality, this paper considers the user quitting ratio. The user quitting ratio is the percentage of users still watching videos at a given time and enables us to address the consequences of quality and stimulus duration on the decision of a user to quit. Considering the value of the user quitting ratio, this paper describes a method that uses content analysis, as well as a network's historical throughput data, to define how video should be encoded to decrease the likelihood of users quitting watching. Unlike previous approaches, the method is independent of the ABR control scheme used by the video player, and the selected ladders perform equivalently across different players with different behaviors. Results of experiments based on real-world network traces demonstrate the usefulness of the proposed method.

Abstract:
Multi-task learning is a successful learning framework which improves the performance of prediction models by leveraging knowledge among related tasks. Referring expression comprehension (REC) and segmentation (RES) are highly relevant tasks, which both are language-guided visual recognition tasks. However, their relations have not yet been fully exploited in previous works. In this paper, a Multiple Relational Learning Network (MRLN) is proposed for multi-task learning of REC and RES. First, a feature-feature interaction learning module is introduced to handle the complicated interactions among features. Moreover, we propose a feature-task dependence learning module, which associates the related features with target tasks. Furthermore, a task-task relationship learning module is designed, which captures the relationships among tasks automatically and guides the REC and RES fine-tuning adaptively. To verify our proposed approach, experiments are conducted on three benchmark datasets, i.e., RefCOCO, RefCOCO+, and RefCOCOg. Extensive experiments demonstrate that the multiple relationships are more appealing since it alleviates the prediction inconsistency issue in multi-task setup. In addition, the experimental results report the significant performance gains of MRLN over most existing methods, i.e., up to 83.46% for REC and 63.62% for RES over state-of-the-art methods, which demonstrate the validity and superiority of MRLN.

Abstract:
Generative Adversarial Network (GAN) has been widely used for image-to-image translation-based facial attribute editing. Existing GAN networks are likely to generate samples with anomalies, which may be caused by the lack of consistency preservation and feature entanglement. For preserving image consistency, many studies resorted to the design of the network framework and loss functions, e.g. cycle-consistency loss. However, the generator with the cycle-consistency loss could not well preserve the attribute-irrelevant features, and its feature-level noises may possibly cause synthesis abnormalities. For feature disentanglement, previous works were devoted to mining the implicit semantics of feature spaces, while these semantics are not stable and intuitive enough. For consistency preservation, we propose a target consistency loss to complement the cycle-consistency loss, and enable the network to learn to preserve features of the image more directly. Meanwhile, we filter out outlier feature maps to reduce the synthesis abnormalities and propose a dynamic dropout to better preserve the attribute-irrelevant features. For feature disentanglement, we encode the image semantics more stably and intuitively and propose an entropy regularization to decouple these semantics to allow independent editing of different attributes. The proposed modules are general and can be easily integrated with available image-to-image-based GAN models like StarGAN, AttGAN, and STGAN. Extensive experiments on CelebA dataset show that the our strategy can largely reduce the artifacts and better preserve the subtle facial features, and thus significantly improve the facial editing performance of these mainstream GAN models, in terms of FID, PSNR and SSIM. Additional experiments on realistic expression editing show that our method outperforms StarGAN on RaFD, and achieves much better generalization performances than the three baselines on datasets of FFHQ, RaFD and LFW.

Abstract:
Image-text matching has become a challenging task in the multimedia analysis field. Many advanced methods have been used to explore local and global cross-modal correspondence in matching. However, most methods ignore the importance of eliminating potential irrelevant features in the original features of each modality and cross-modal common feature. Moreover, the features extracted from regions in images and words in sentences contain cluttered background noise and different occlusion noise, which negatively affects alignment. Different from these methods, we propose a novel DCT-Transformer Adversarial Network (DTAN) for image-text matching in this paper. This work can obtain an effective metric based on two aspects: i) DCT-Transformer uses DCT (Discrete Cosine Transform) method based on a transformer mechanism to extract multi-domain common representations and eliminate irrelevant features from different modalities (inter-modal). Among them, DCT divides multi-modal content into chunks of different frequencies and quantifies them. ii) The adversarial network introduces an adversary idea by combining the original features of various single modalities and the multi-domain common representation, alleviating the background noise within each modality (intra-modal). The proposed adversarial feature augmentation method can easily obtain the common representation that is only useful for alignment. Extensive experiments are completed on the benchmark datasets Flickr30 K and MS-COCO, demonstrating the superiority of the DTAN model over the state-of-the-art methods.

Abstract:
Given the rapid growth of user-generated videos, internet traffic has been heavily dominated by online video streaming. Caching videos on edge servers in close proximity to users has been an effective approach to reduce the backbone traffic and the request response time, as well as to improve the video quality on the user side. Video popularity, however, can be highly dynamic over time. The cost of cache replacement at edge servers, particularly that related to service interruption during replacement, is not yet well understood. This paper presents a novel lightweight video caching algorithm for edge servers, seeking to optimize the hit rate with real-time decisions and minimized cost. Inspired by recent advances in deep Q-learning, our DQN-based online video caching (DQN-OVC) makes effective use of the rich and readily available information from users and networks. We decompose the Q-value function as a product of the video value function and the action function, which significantly reduces the state space. We instantiate the action function for cost-aware caching decisions with low complexity so that the cached videos can be updated continuously and instantly with dynamic video popularity. We used video traces from Tencent, one of the largest online video providers in China, to evaluate the performance of our DQN-OVC and to compare it with state-of-the-art solutions. The results demonstrate that DQN-OVC significantly outperforms the baseline algorithms in the edge caching context.

Abstract:
Unsupervised domain adaptation for person re-identification (Re-ID) suffers severe domain discrepancies between source and target domains. To reduce the domain shift caused by the changes of context, camera style, or viewpoint, existing methods in this field fine-tune and adapt the Re-ID model with augmented samples, either translating source samples to the target style or assigning pseudo labels to the target. The former methods may lose identity details but keep redundant source background during translation. In contrast, the latter techniques may give noisy labels when the model meets the unseen background and person pose. We mitigate the domain shift in the former translation direction by cyclically decoupling environment and identity-related features. We propose a novel individual-preserving and environmental-switching cyclic generation network (IPES-GAN). Our network has the following distinct features: 1) Decoupled features instead of fused features: we encode the images into an individual part and an environmental part, which are proved beneficial to generation and adaptation; 2) Cyclic generation instead of one-step adaptive generation. We swap source and target environment features to generate cross-domain images with preserved identity-related features conditioned with source (target) background features and then changed again to generate back the input image so that cyclic generation runs in a self-supervised way. Experiments carried out on two significant benchmarks: Market-1501 and DukeMTMC-Reid, reveal state-of-the-art performance.

Abstract:
The demand for mobile multimedia streaming services has been steadily growing in recent years. Mobile multimedia broadcasting addresses the shortage of radio resources but introduces a network error recovery problem. Retransmitting multimedia segments that are not correctly broadcast can cause service disruptions and increased service latency, affecting the quality of experience perceived by end users. With the advent of networking paradigms based on virtualization technologies, mobile networks have been enabled with more flexibility and agility to deploy innovative services that improve the utilization of available network resources. This paper discusses how mobile multimedia broadcast services can be designed to prevent service degradation by using the computing capabilities provided by multiaccess edge computing (MEC) platforms in the context of a 5G network architecture. An experimental platform has been developed to evaluate the feasibility of a MEC application to provide adaptive error recovery for multimedia broadcast services. The results of the experiments carried out show that the proposal provides a flexible mechanism that can be deployed at the network edge to lower the impact of transmission errors on latency and service disruptions.

Abstract:
This paper proposes an image fusion framework based on separate representation learning, called IFSepR. We believe that both the co-modal image and the multi-modal image have common and private features based on prior knowledge, exploiting this disentangled representation can help to image fusion, especially to fusion rule design. Inspired by the autoencoder network and contrastive learning, a multi-branch encoder with contrastive constraints is built to learn the common and private features of paired images. In the fusion stage, based on the disentangled features, a general fusion rule is designed to integrate the private features, then combining the fused private features and the common feature are fed into the decoder, reconstructing the fused image. We perform a series of evaluations on three typical image fusion tasks, including multi-focus image fusion, infrared and visible image fusion, medical image fusion. Quantitative and qualitative comparison with five state-of-art image fusion methods demonstrates the advantages of our proposed model.

Abstract:
Image fusion synthesizes a new image from multiple images of the same scene. The synthesized image should be suitable for human visual perception and follow-up high-level image-processing tasks. However, existing methods focus on fusing low-level features, ignoring high-level semantic perception information. We propose a new end-to-end model to obtain a more semantically consistent image in infrared and visible image fusion, termed semantic-supervised dual-discriminator generative adversarial network (SDDGAN). In particular, we design an information quantity discrimination (IQD) block to guide fusion progress. For each source image, the block determines the weight for preserving each semantic object’s feature. By this way, the generator learns to fuse various semantic objects via different weights to preserve their characteristics. Moreover, the dual discriminator is employed to identify the distribution of infrared and visible information in the fused image. Each discriminator acts on a certain modality (infrared/visible) of different semantic objects in the fused image to preserve and enhance their modality features. Thus, our fused image is more informative. Both the thermal radiation in the infrared image and the visible image texture details can be well preserved. Qualitative and quantitative experiments demonstrate the superiority of our SDDGAN over state-of-the-art methods in terms of visual effects, efficiency, and quantitative metrics.

Affiliations: Department of Applied Mathematics and Theoretical Physics (DAMTP), Hong Kong University of Science and Technology, University of Cambridge, Cambridge, U.K.; College of Computer Science and Technology, Zhejiang University, Shatin, China; Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong SAR, China; Department of Computer Science, Dalian University of Technology, Dalian, China; School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China; School of Science and Technology, Hong Kong Metropolitan University, Ho Man Tin, Hong Kong SAR, China; Department of Applied Mathematics and Theoretical Physics (DAMTP), University of Cambridge, Cambridge, U.K.; School of Computer Science and Engineering, South China University of Technology, Guangzhou, China

Abstract:
RGB-D salient object detection aims to detect visually distinctive objects or regions from a pair of the RGB image and the depth image. State-of-the-art RGB-D saliency detectors are mainly based on convolutional neural networks but almost suffer from an intrinsic limitation relying on the labeled data, thus degrading detection accuracy in complex cases. In this work, we present a self-supervised self-ensembling network (S ^3 Net) for semi-supervised RGB-D salient object detection by leveraging the unlabeled data and exploring a self-supervised learning mechanism. To be specific, we first build a self-guided convolutional neural network (SG-CNN) as a baseline model by developing a series of three-layer cross-model feature fusion (TCF) modules to leverage complementary information among depth and RGB modalities and formulating an auxiliary task that predicts a self-supervised image rotation angle. After that, to further explore the knowledge from unlabeled data, we assign SG-CNN to a student network and a teacher network, and encourage the saliency predictions and self-supervised rotation predictions from these two networks to be consistent on the unlabeled data. Experimental results on seven widely-used benchmark datasets demonstrate that our network quantitatively and qualitatively outperforms the state-of-the-art methods.

Abstract:
Recent works have validated the benefit of integrating spatial information into deep networks to improve pixel-level prediction tasks such as monocular depth estimation. However, how to efficiently and robustly integrate spatial cues retains as an open problem. In this paper, we introduce the Side Prediction Aggregation (termed SPA) method to enhance the embedding of scene structural information from low-level to high-level layers. To improve the estimation accuracy, the proposed method is further equipped with continuous Spatial Refinement Loss (termed SRL) at multiple resolutions with negligible extra computation. Besides, the proposed sequential network can further perform adversarial learning at multiple resolutions. Such an adversarial refinement strategy greatly improves the accuracy of estimated depth with a little extra computation. Without using any pre-trained models, our network achieves the the-state-of-art accuracy on KITTI, NYUD V2, and Cityscapes datasets, which has achieved real-time depth estimation online.

Abstract:
Despite the development of computer vision techniques, the micro-expression (ME) recognition task still remains a great challenge because MEs have very low intensity and short duration. However, the ME recognition is of great significance since it provides important clues for real affective states detection. This paper proposes a novel Block Division Convolutional Network (BDCNN) with the implicit deep features augmentation. In detail, BDCNN learns from four optical flow features computed by the onset and apex frames of each video. It innovatively divides each image into a set of small blocks in the deep learning model, then the convolution and pooling operations are performed on these small blocks in sequence. To handle the small sample size problem in the micro-expression data, this study uses the improved implicit semantic data augmentation algorithm in the deep features space. Experiments are conducted on three publicly available databases, viz, CASME II, SMIC, and SAMM. Experimental results show that our model outperforms the state-of-the-art methods by attaining the accuracy of 84.32% and F1-score of 82.13% on the 3-class datasets, and the accuracy of 81.82% and F1-score of 75.46% on the 5-class datasets, respectively. Our source code is publicly available for non-commercial or research use at https://github.com/MLDMXM2017/BDCNN.

Abstract:
Obscured person re-identification (Re-ID) aims to match an obscured image with a complete image of the same person captured by other cameras. As a major challenge in person identification, occlusion severely affects the effectiveness of most traditional person Re-ID methods. To solve this problem, this study proposes a trajectory association method, which, as a pre-processing technique for person Re-ID, can narrow the search range and reduce the problem of degradation caused by mixing. We investigate the method of converting the fuzzy association between sets into the precise association between elements for M video objects and N phone objects (trajectory information) with fuzzy group association relationships at the crime scene. First, we decompose the M-N precise association problem and analyze the similarity of the video objects in the source point and on the trajectories. Then, we define high-similarity points, study their distribution characteristics in different trajectories, and find that there is a significant difference between the distribution of high-similarity points in correct and incorrect matching trajectories. We simplify the full-path association problem into a partial-path high-similarity point distribution difference problem, which effectively reduces the difficulty in accurate association relationship construction. The association experiments in simple and mixed scenarios as well as Re-ID experiments on the PRPW and Market1501 demonstrate the effectiveness of our method.

Abstract:
The rib fracture is a common type of thoracic skeletal trauma, and its inspections using computed tomography (CT) scans are critical for clinical evaluation and treatment planning. However, it is often challenging for radiologists to quickly and accurately detect rib fractures due to tiny objects and blurriness in large 3D CT images. Previous diagnoses for automatic rib fracture mostly relied on deep learning (DL)-based object detection, which highly depends on label quality and quantity. Moreover, general object detection methods did not take into consideration the typically elongated and oblique shapes of ribs in 3D volumes. To address these issues, we propose a shape-aware method based on DL called SA-FracNet for rib fracture detection and segmentation. First, we design a pixel-level pretext task founded on contrastive learning on massive unlabeled CT images. Second, we train the fine-tuned rib fracture detection model based on the pre-trained weights. Third, we develop a fracture shape-aware multi-task segmentation network to delineate the fracture based on the detection result. Experiments demonstrate that our proposed SA-FracNet achieves state-of-the-art rib fracture detection and segmentation performance on the public RibFrac dataset, with a detection sensitivity of 0.926 and segmentation Dice of 0.754. Test on a private dataset also validates the robustness and generalization of our SA-FracNet.

Abstract:
Due to its significant capability of modeling long-range dependencies, vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks. However, the inherent problems of transformers such as the huge computational cost and memory footprint are still two unsolved issues that will block the deployment of ViT based person Re-ID models on resource-limited edge devices. Our goal is to reduce both the inference complexity and model size without sacrificing the comparable accuracy on person Re-ID, especially for tasks with occlusion. To this end, we propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads with the guidance of the attention map in a hardware-friendly way. We first calculate the entropy in the key dimension and sum it up for the whole map, and the corresponding head parameters of maps with high entropy will be removed for model size reduction. Then we combine the similarity and first-order gradients of key tokens along the query dimension for token importance estimation and remove redundant key and value tokens to further reduce the inference complexity. Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals. For example, our proposed pruning strategy on ViT-Base enjoys 29.4% FLOPs savings with 0.2% drop on Rank-1 and 0.4% improvement on mAP, respectively.

Abstract:
Weakly supervised semantic segmentation with only image-level labels aims to reduce annotation costs for the segmentation task. Existing approaches generally leverage class activation maps (CAMs) to locate the object regions for pseudo label generation. However, CAMs can only discover the most discriminative parts of objects, thus leading to inferior pixel-level pseudo labels. To address this issue, we propose a saliency guided Inter- and Intra-Class Relation Constrained (I^2CRC) framework to assist the expansion of the activated object regions in CAMs. Specifically, we propose a saliency guided class-agnostic distance module to pull the intra-category features closer by aligning features to their class prototypes. Further, we propose a class-specific distance module to push the inter-class features apart and encourage the object region to have a higher activation than the background. Besides strengthening the capability of the classification network to activate more integral object regions in CAMs, we also introduce an object guided label refinement module to take a full use of both the segmentation prediction and the initial labels for obtaining superior pseudo-labels. Extensive experiments on PASCAL VOC 2012 and COCO datasets demonstrate well the effectiveness of I^2CRC over other state-of-the-art counterparts.

Affiliations: Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing, China; Digital Content and Media Sciences Research Division, National Institute of Informatics, Tokyo, Japan; Department of Mechanical and Control Engineering, Kyushu Institute of Technology, Kitakyushu, Japan; School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing, China; College of Automation and College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing, China

Abstract:
Representation learning steered robust face image super-resolution (FSR) methods have attracted extensive attention in the past few decades. Most previous methods were devoted to exploiting the local position patches in the training set for FSR. However, they usually overlooked the sufficient usage of the contextual information around the testing patches, which are useful for stable representation learning. In this article, we attempt to utilize the context-patch around the testing patch and propose a method named context-patch representation learning with adaptive neighbor embedding (CRL-ANE) for FSR. On one hand, we simultaneously use the testing position patch and its adjacent ones for stable representation weight learning. This contextual information can compensate for recovering missing details in the target patch. On the other hand, for each input patch set, due to its inherent facial structural properties, we design an adaptive neighbor embedding strategy to elaborately and adaptively choose primary candidates for more accurate reconstruction. These two improvements enable the proposed method to achieve better SR performance than some of the other methods. Qualitative and quantitative experiments on some benchmarks have validated the superiority of the proposed method over some state-of-the-art methods.

Abstract:
Weakly supervised person re-identification (Re-ID) is appealing to handle real-world tasks by using state information that is available without manual annotation. At present, most methods perform unsupervised cross domain (UCD) learning by transferring the knowledge from the labeled source domain to the unlabeled target domain, which results in poor performance due to the severe shift. To address this problem, in this paper, we utilize the tracklet and camera information as weak supervision to propose a distribution discrepancy minimization learning (DDML) model for UCD person Re-ID. In addition to aligning data distributions from the perspective of domain adaptation learning, two losses are developed from the view of neighborhood invariance exploration to optimize matching results. Specifically, to bridge the gap between domains, we propose a camera-distribution-based (CDB) loss to align pair-wise distance distributions. Furthermore, to alleviate the biased search within the target domain, we propose a ranking-confidence-based (RCB) loss to perform the mined neighborhood for intra-camera and inter-camera separately to explore a high degree of confidence neighbor relations. Extensive experiments on three challenging datasets demonstrate that applying our method to unlabeled target domain outperforms current weakly supervised methods for person Re-ID.

Abstract:
Plenoptic point clouds are more complete representations of three-dimensional (3-D) objects than single-color point clouds, as they can have multiple colors per spatial point, representing colors of each point as seen from different view angles. They are more realistic but also involve a larger volume of data in need of compression. Therefore, in this paper, a multiview-video-based framework is proposed to exploit the correlations in color across different viewpoints to compress plenoptic point clouds efficiently. To the best of the authors' knowledge, this is the first work to exploit correlations in color across different viewpoints using a multiview-video-based framework. In addition, it is observed that some unoccupied pixels, which do not have corresponding points in plenoptic point clouds and are of no use to the quality of the reconstructed plenoptic point cloud colors, may cost many bits. To address this problem, a block-based group smoothing and a combined occupancy-map-based rate distortion optimization and four-neighbor average residual padding are further proposed to reduce the bit cost of unoccupied color pixels. The proposed algorithms are implemented in the moving pictures experts group (MPEG) video-based point cloud compression (V-PCC) and multiview extension of High Efficiency Video Coding (MV-HEVC) reference software. Compared with the V-PCC independently applied to each view direction, the proposed algorithms can provide a BD-rate reduction of over 70%.

Abstract:
Multimodal Emotion Recognition is challenging because of the heterogeneity gap among different modalities. Due to the powerful ability of feature abstraction, Deep Neural Networks (DNNs) have exhibited significant success in bridging the heterogeneity gap in cross-modal retrieval and generation tasks. In this work, a DNNs-based Multi-channel Weight-sharing Autoencoder with Cascade Multi-head Attention (MCWSA-CMHA) is proposed to generically address the affective heterogeneity gap in MER. Specifically, multimodal heterogeneity features are extracted by multiple independent encoders, and then a scalable heterogeneous feature fusion module (CMHA) is realized by connecting multiple multi-head attention modules in series. The core of the proposed algorithm is to reduce the heterogeneity between the output features of different encoders through the unsupervised training of MCWSA, and then to model the affective interactions between different modal features through the supervised training of CMHA. Experimental results demonstrate that the proposed MCWSA-CMHA achieves outperformance on two publicly available datasets compared with the state-of-the-art techniques. In addition, visualization experiments and approximation experiments are used to verify the effectiveness of each module in the proposed algorithm, and the experimental results show that the proposed MCWSA-CMHA can mine more emotion-related information among multimodal features compared with other fusion methods.

Abstract:
The security of spread spectrum (SS) watermarking largely depends on the difficulty of estimating its secret key. Some estimators have been proposed to estimate the secret key in the known-message attack (KMA) scenario. However, the estimation accuracies of existing estimators are not satisfactory when the number of observations is not large enough. Currently, it is still a challenging and open problem to design more effective estimators. In this paper, we propose an equivalent keys (EK)-based estimator to estimate the secret key for both the traditional and more secure SS watermarking methods. Equivalent keys form an equivalent region, which is the intersection of a unit hypersphere and a hypercone. According to the Monte Carlo simulation, we find that the secret key can be estimated by adding up the equivalent keys uniformly sampled from the equivalent region. Thus, the proposed estimator selects equivalent keys from randomly-generated vectors by exploiting the pairs of watermarked signals and their embedded messages. A theoretical analysis is performed for the proposed estimator to evaluate the estimation accuracy. Experimental results verify the theoretical analysis and show the superiority of the proposed estimator over existing estimation methods. Furthermore, this paper also shows the insecurity of the more secure SS watermarking methods in the KMA scenario from a practical perspective for the first time.

Abstract:
JPEG image encryption aims at effectively converting the original JPEG image into a noise-like image that does not contain any useful information of original image. Existing schemes for JPEG image encryption, however, may not attain a good balance in terms of file size increment and encryption security. To address the problem, we design a novel JPEG image encryption scheme. Different from existing schemes, we first predict DC coefficients by an adaptive prediction method. Subsequently, the histogram of DC coefficient prediction errors is encrypted by combining the prediction errors and random integers to reduce the encoded length, which can ensure a very small increment of file size. Furthermore, we construct the RS (run/size) pairs in each DCT block and then implement the permutation for both RS pairs extracted from the upper left corner of each DCT block and all DCT blocks excluding DC coefficients, which can further distort the image contents. Extensive experiments demonstrate that, compared with existing JPEG image encryption schemes, our scheme can ensure not only the JPEG format compatibility for encrypted image, but also keep a very small file size increment and the superior security performance.

Abstract:
Vehicle Re-Identification is to find the same vehicle from images captured in different views under cross-camera scenarios. Traditional methods focus on depicting the holistic appearance of a vehicle, but they suffer from the hard samples with the same vehicle type and color. Recent works leverage the discriminative visual cues to solve this problem, where three challenges exist as follows. First, vehicle features are misaligned and distorted because of the viewpoint variance. Second, the discriminative visual cues are usually subtle, which is easy to be diluted by the large area of non-discriminative regions in subsequent average pooling modules. Third, these discriminative visual cues are dynamic for the same image when it compares with different vehicle images. To tackle the above problems, we project the vehicle images from 2D to 3D space and rotate them to the same view, and leverage the viewpoint aligned features to enhance the discriminative parts for vehicle ReID. In detail, our method consists of three sub-modules, 1) The 3D viewpoint alignment module restores the 3D information of the vehicle from a single vehicle image, and then rotates and re-renders it under fixed viewpoints. It enables fine-grained viewpoint alignment and relieves the distortion of the vehicle caused by the viewpoint variation. 2) The discriminative parts enhancement module performs feature enhancement guided by the prior distribution of distinctive parts. 3) The adaptive duplicated parts suppression module guides the network to focus on the most discriminative parts, which not only prevents the dilution of the high responses but also provides explainable evidence. The experimental results reveal our method achieves new state-of-the-art on large scale vehicle ReID dataset.

Abstract:
Gait recognition aims to identify people by their walking patterns. Normal human walking is a periodic movement, however, existing gait recognition methods rarely make use of gait periodicity. In this paper, we propose the gait Periodicity-inspired Temporal feature Pyramid aggregator (PTP), which introduces gait periodicity priors into gait feature extraction, resulting in a strong and robust skeleton-based gait recognition method called CycleGait. Specifically, inspired by gait periodicity, PTP adopts multiple parallel temporal convolution operators with pyramid temporal kernel sizes to extract temporal gait features. Then, PTP cooperates with the spatial Graph Convolutional Network (GCN) to form the GCN-PTP network. CycleGait uses this network to extract spatio-temporal gait features from a sequence of skeleton coordinates. In addition, to improve CycleGait's robustness and performance, we feed more gait samples with various gait cycles into CycleGait with the plug-and-play Irregular Pace Converter (IPC), which can automatically convert normal pace into irregular and reasonable paces. Extensive experiments conducted on the CASIA-B dataset and OG RGB+D dataset show that CycleGait has excellent performance in various complex scenarios, namely, cross-view and cross-walking conditions, and becomes one of the best SOTA methods, which not only outperforms the best preexisting gait recognition methods by a large margin but also exhibits a significant level of robustness.

Abstract:
Single image dehazing is a critical problem in computer vision. However, most recently proposed learning-based dehazing methods achieve unsatisfactory quality with dehazed images due to inaccurate parametric estimation. The size of these models is also large to be applied with mobile devices’ limited resources. Last, most models are tailored to image dehazing, achieving poor migration. Thus, we propose a compact multiscale attention feature fusion network with a model size of 2 MB called MSAFF-Net to perform end-to-end single image dehazing. In the proposed model, we design a simple and powerful feature extraction module to extract complex features from hazy images. We use a channel attention module and a multiscale spatial attention module to consider the regions with haze-relevant features. To our knowledge, this study is the first to directly apply the attention mechanism rather than to embed it into certain modules for single image dehazing. We compare MSAFF-Net with other approaches on the NTIRE18, RESIDE, and Middlebury Stereo datasets. We show that MSAFF-Net achieves comparable or better performance than other models. We also extend MSAFF-Net to single image deraining, and various experiments demonstrate its effectiveness. Results suggest that MSAFF-Net can directly restore clear images using channels with the most useful haze- or rain-relevant features and spatial locations.

Abstract:
With the rapid development of virtual reality (VR) technology, VR headsets, a.k.a. Head-Mounted Displays (HMDs), are widely available, allowing immersive 3D content to be viewed. A natural need for truly immersive VR is to allow bidirectional communication: the user should be able to interact with the virtual world using facial expressions and eye gaze, in addition to traditional means of interaction. The typical application scenario includes VR virtual conferencing and virtual roaming, where ideally users are able to see other users’ expressions and have eye contact with them in the virtual world. In addition, eye gaze also provides a natural means of interaction with virtual objects. Despite significant achievements in recent years for reconstruction of 3D faces from RGB or RGB-D images, it remains a challenge to reliably capture and reconstruct 3D facial expressions including eye gaze when the user is wearing an HMD, because the majority of the face is occluded, especially those areas around the eyes which are essential for recognizing facial expressions and eye gaze. In this paper, we introduce a novel real-time system that is able to capture and reconstruct 3D faces wearing HMDs, and robustly recover eye gaze. We further propose a novel method to map eye gaze directions to the 3D virtual world, which provides a novel and useful interactive mode in VR. We compare our method with state-of-the-art techniques both qualitatively and quantitatively, and demonstrate the effectiveness of our system using live capture.

Abstract:
Generative Adversarial Networks (GANs) have been widely-used in image translation, but their high computation and storage costs impede the deployment on mobile devices. Prevalent methods for CNN compression cannot be directly applied to GANs due to the peculiarties of GAN tasks and the unstable adversarial training. To solve these, in this paper, we introduce a novel GAN compression method, termed DMAD, by proposing a Differentiable Mask and a co-attention Distillation. The former searches for a light-weight generator architecture in a training-adaptive manner. To overcome channel inconsistency when pruning the residual connections, an adaptive cross-block group sparsity is further incorporated. The latter simultaneously distills informative attention maps from both the generator and discriminator of a pre-trained model to the searched generator, effectively stabilizing the adversarial training of our light-weight model. Experiments show that DMAD can reduce the Multiply Accumulate Operations (MACs) of CycleGAN by 13× and that of Pix2Pix by 4× while retaining a comparable performance against the full model.

Abstract:
Speech emotion recognition has always been a challenging task due to the difference in emotion expression and perception. Currently, in the supervised speech emotion recognition systems, the soft label overcomes the disadvantage of the hard label losing annotations variability and emotion perception subjectivity, but it only considers the emotion perceptions of a few annotators and thus still brings high statistical error. For this issue, this paper redefines the target and designs a novel loss function (denoted as inter-class difference loss), which enables the network to adaptively learn an emotion distribution in all utterances. This not only restricts the negative class probability less than the positive class probability, but also limits the negative class probability close to zero. To make the speech emotion recognition system more efficient, this paper proposes an end-to-end network, called response residual network (R-ResNet), which incorporates the ResNet for features extraction, together with the emotion response module for data augmentation and variable-length data processing. Finally, the experimental results not only demonstrate the advanced performance of our work, but also confirm that the ambiguous utterances contain emotional characteristics. In addition, another interesting finding is that, on the unbalanced dataset, the batch normalization (BN) after addition performs better than BN before addition.

Abstract:
3D human pose estimation using monocular images is an important yet challenging task. Existing 3D pose detection methods exhibit excellent performance under normal conditions however their performance may degrade due to occlusion. Recently some occlusion aware methods have also been proposed, however, the occlusion handling capability of these networks has not yet been thoroughly investigated. In the current work, we propose an occlusion-guided 3D human pose estimation framework and quantify its occlusion handling capability by using different protocols. The proposed method estimates more accurate 3D human poses using 2D skeletons with missing joints as input. Missing joints are handled by introducing occlusion guidance that provides extra information about the absence or presence of a joint. Temporal information has also been exploited to better estimate the missing joints. A large number of experiments are performed for the quantification of occlusion handling capability of the proposed method on three publicly available datasets in various settings including random missing joints, fixed body parts missing, and complete frames missing, using mean per joint position error criterion. In addition to that, the quality of the predicted 3D poses is also evaluated using action classification performance as a criterion. 3D poses estimated by the proposed method achieved significantly improved action recognition performance in the presence of missing joints. Our experiments demonstrate the effectiveness of the proposed framework for handling the missing joints as well as quantification of the occlusion handling capability of the deep neural networks.

Abstract:
We introduce Grouping by Center, a novel grouping approach for the bottom-up human pose estimation, which detects human joint first and then does grouping. The grouping strategy is the critical factor for the bottom-up pose estimation. To increase the conciseness and accuracy, we propose to use the center of the body as a grouping clue. More concretely, we predict the offsets from the keypoints to the body centers. Keypoints with aligned shifted results will be grouped as one person. However, the multi-scale variance of people can affect the prediction of the grouping clue, which has been neglected in previous research. To resolve the scale variance of the offset, we put forward a Multi-scale Translation Layer and an iterative refinement. Furthermore, we scheme a greedy grouping strategy with a dynamic threshold due to the various scales of instances. Through a comprehensive comparison, our framework is validated to be effective and practical. We also lay out the state-of-the-art performance revolving the bottom-up multi-person pose estimation on the MS-COCO dataset and the CrowdPose dataset.

Abstract:
Document object detection is a challenging task due to layout complexity and object diversity. Most of existing methods mainly focus on vision information, neglecting representative inherent spatial-related relationship among document objects. To capture structural information and contextual dependencies, we propose a novel document object detector based on spatial-related relation and vision (SRRV). It consists of three parts: vision feature extraction network, relation feature aggregation network and result refinement network. Vision feature extraction network enhances information propagation of hierarchical feature pyramid by adopting feature augmentation paths. Then, relation feature aggregation network combines graph construction module and graph learning module. Specifically, graph construction module calculates spatial information from geometric attributes of region proposals to encode relation information, while graph learning module stacks Graph Convolutional Network (GCN) layers to aggregate relation information at global scale. Both the vision and relation features are fed into result refinement network for feature fusion and relational reasoning. Experiments on the PubLayNet, POD and Article Regions datasets demonstrate that spatial relation information improves the performance with better accuracy and more precise bounding box prediction.

Abstract:
The attention-based networks have become prevailing recently in visual question answering (VQA) due to their high performances. However, the extensive memory consumption of attention-based models poses excessive-high demand for the implementation equipment, raising concerns about their future application scenarios. Therefore, designing an efficient and lightweight VQA model is central to expanding possible application areas. Our work presents a novel lightweight attention-based VQA model, namely residual weight-sharing attention network (RWSAN), consisting of residual weight-sharing attention (RWSA) layers cascaded in depth. Each RWSA layer models the textual representation with self residual weight-sharing attention (SRWSA) and captures question features and question-image interactions with self-guided residual weight-sharing attention (SGRWSA). Inside each RWSA layer, the proposed low-rank attention (LRA) units perform residual learning with learned connection patterns and shared parameters, and every stacked RWSA layer also uses the same parameters. Extensive ablation experiments with quantitative and qualitative analysis are conducted to illustrate the effectiveness and generality of RWSA. Experiments on VQA-v2, GQA, and CLEVR datasets show that the RWSAN achieves competitive performance with much fewer parameters over the state-of-the-art methods. We release our code at https://github.com/BrightQin/RWSAN.

Abstract:
Deep learning based visual-to-sound generation systems have been developed that identify and create audio features from video signals. However, these techniques often fail to consider the time-synchronicity of the visual and audio features. In this paper we introduce a novel method for guiding a class-conditioned GAN to synthesize representative audio with temporally-extracted visual information. We accomplish this visual-to-sound generation task by adapting the synchronicity traits between the audio-visual modalities. Our proposed FoleyGAN model is capable of conditioning action sequences of visual events leading to the generation of visually aligned realistic soundtracks. We expanded our previously proposed Automatic Foley data set. We evaluated FoleyGAN’s synthesized sound output through human surveys that show noteworthy (on average 81%) audio-visual synchronicity performance. Our approach outperforms other baseline models and audio-visual data sets in statistical and ablation experiments achieving improved IS, FID and NDB scores. In ablation analysis we showed the significance of our visual and temporal feature extraction method as well as augmented performance of our generation network. Overall, our FoleyGAN model showed sound retrieval accuracy of 76.08% surpassing existing visual-to-audio synthesis deep neural networks.

Abstract:
No-reference bitstream-layer models for point cloud quality assessment (PCQA) use the information extracted from a bitstream for real-time and nonintrusive quality monitoring. We propose a no-reference bitstream-layer model for the perceptual quality assessment of video-based point cloud compression (V-PCC) encoded point clouds. First, we study the relationship between the perceptual coding distortion and the texture quantization parameter (TQP) when geometry encoding is lossless. The results indicate that the perceptual coding distortion depends on the texture complexity (TC). Next, we estimate TC using TQP and the texture bitrate per pixel (TBPP), both of which are extracted from the compressed bitstream without resorting to complete decoding. This allows us to build a texture distortion model as a function of TQP and TBPP. By combining this texture distortion model with a geometry distortion model that depends on the geometry quantization parameter (GQP), we obtain an overall no-reference bitstream-layer PCQA model that we call bitstreamPCQ. Experimental results show that the proposed model markedly outperforms existing models in terms of widely used performance criteria, including the Pearson linear correlation coefficient (PLCC), the Spearman rank order correlation coefficient (SRCC) and the root mean square error (RMSE).

Abstract:
This paper presents a strategic approach to tackling trimap-free natural image matting. Specifically, to address the false detection issue of existing trimap-free matting algorithms when the foreground object is not uniquely defined, we design a novel tangled structure (TangleNet) to handle foreground detection and matting prediction simultaneously. TangleNet enables information exchange between foreground segmentation and alpha prediction, producing high-quality alpha mattes for the most salient foreground object based on RGB inputs alone. TangleNet boosts network performance with a frequency-guided attention mechanism utilizing wavelet data. Additionally, we pretrain for salient object detection to aid in the foreground segmentation. Experimental results demonstrate that TangleNet is on par with the state-of-the-art matting methods requiring additional inputs, and outperforms all previous trimap-free algorithms in terms of both qualitative and quantitative results.

Abstract:
By integrating effective features of multi-modality medical images to provide richer information, multi-modality medical image fusion has been substantially used in computer-aided diagnosis applications. However, many existing fusion schemes do not consider how to eliminate the effects of the noise in source medical images and cannot provide enough details and textures for disease diagnosis. To address the problems above, we propose a new fidelity-driven optimization (FDO) reconstruction and details preserving guided-based fusion method for multi-modality medical images. To overcome the influence of noise in multi-modality medical images, a rank coefficient optimization method of low-rank approximation based on weighted mean curvature is proposed to reconstruct multi-modality medical image. Moreover, we propose an iterative detail preserving guided fusion (DPGF) method to integrate more textures and detail information of source multi-modality medical images, while ensuring high signal-to noise ratios. The experimental results show that the proposed method outperforms some of the state-of-the-art fusion methods. Specifically, the extensive experiments prove that our method has high robustness for noisy medical images, which also indicates the application prospects in diagnosis applications.

Abstract:
Developing an objective quality assessment model for user-generated content (UGC) videos is significant for multimedia applications, and also a challenge due to the diversity of video content and unpredictability of distortions. To predict the perceived quality, it is necessary to consider the human visual system, in which attention in visual and memory domains is an essential component. With the idea that the stimulus-driven bottom-up mechanism and cognition-driven top-down mechanism work in synergy to generate quality-aware attention, we propose an end-to-end blind video quality assessment (VQA) algorithm based on visual and memory attention modeling. First, a quality-aware visual attention module is established to obtain spatial-temporal attention-guided representations for frame-level quality perception. Specifically, an attention selection and confluence method is developed by circularly integrating the quality-aware attention information to spatial-temporal content features. Then, with the aid of a quality-aware memory attention module, the video-level attention-guided features are inferred through the dimension and attention reshaping of frame-level representations. The video quality is predicted with the guidance of frame-level visual attention and video-level memory attention in an end-to-end structure. Experimental results on five UGC-VQA databases (CVD2014, LIVE-Qualcomm, KoNViD-1 k, LIVE-VQC and Youtube-UGC) demonstrate the effectiveness of our modules.

Abstract:
Most commercial players adopt adaptive bitrate (ABR) algorithms to dynamically decide each chunk's bitrate based on the perceived network bandwidth and buffer occupancy. However, current ABR algorithms are agnostic of audio bitrate selection since they deem it has negligible influence on video bitrate selection due to small size of audio chunks. Nevertheless, with the development of audio technologies, the bitrate of audio content increases dramatically in recent years. Thus, inappropriate audio selection can significantly affect video selection and deteriorate the viewing experience. To tackle these inefficiencies, we propose a deep Reinforcement learning-based ABR algorithm that takes Audio and Video quality into account (RAV) to circumvent a series of suboptimal performances, like low playback quality, frequent playback interruptions, poor playback smoothness, and undesirable combinations of video and audio chunks. Furthermore, RAV trains a neural network model that automatically outputs the bitrates for future audio and video chunks without relying on any presumptions about the environment, achieving good robustness to a broad spectrum of conditions. By conducting trace-driven and real-world experiments, we demonstrate that RAV significantly ameliorates the average overall viewing quality by 37.96%-118.20% over the state-of-the-art ABR algorithms. In addition, we also conduct subjective experiments by inviting 32 volunteers, and 27/32 users strongly agree that RAV provides them a better viewing experience than existing ABR solutions.

Abstract:
Recently, various view synthesis distortion estimation models have been studied to better serve 3-D video coding. However, they can hardly model the relationship quantitatively among different levels of depth changes, texture degeneration, and view synthesis distortion (VSD), which is crucial for rate-distortion optimization and rate allocation. In this paper, an auto-weighted layer representation based view synthesis distortion estimation model is developed. Firstly, sub-VSD (S-VSD) is defined according to the level of depth changes and their associated texture degeneration. After that, a set of theoretical derivations demonstrate that the VSD can be approximately decomposed into the S-VSDs multiplied by their associated weights. To obtain the S-VSDs efficiently, a layer-based representation method is developed, where all the pixels with the same level of depth changes are represented with a layer. It enables the S-VSD calculation at the layer level. Meanwhile, a nonlinear mapping function is learnt to accurately represent the relationship between the VSD and S-VSDs, automatically providing weights for the S-VSDs during VSD estimation. To learn such a function, a dataset of the VSD and its associated S-VSDs are built, termed as VSDSet. Experimental results show that the VSD can be accurately estimated with the weights learnt by the nonlinear mapping function once its associated S-VSDs are available. The proposed method outperforms the relevant state-of-the-art methods in both accuracy and efficiency. The VSDSet and source code of the proposed method will be available at https://github.com/jianjin008/.

Abstract:
Top-down pose estimation generally employs a person detector and estimates the keypoints of the detected person. This method assumes that only a single person exists within the bounding box cropped by detection. However, this assumption leads to some challenges in practice. First, a loose-fitted bounding box may include certain body parts of a non-target person. Second, spatial interference between several people exists owing to occlusion, so more than a single person can exist in the cropped image. In such scenarios, the pose estimation may falsely predict the keypoints of two or more persons as those of a single person. To tackle these issues, this paper proposes the human body-aware feature extractor based on the global- and local-reasoning features. The global-reasoning feature considers the entire body using transformer's non-local computation property and the local-reasoning feature concentrates on the individual body parts using convolutional neural networks. With those two features, we extract corrected features by filtering unnecessary features and supplementing necessary features using our proposed novel architecture. Hence, the proposed method can focus on the target person's keypoints, thereby mitigating the aforementioned concerns. Our method achieves noticeable improvement when applied to state-of-the-art top-down pose estimation networks.

Abstract:
Unsupervised domain adaptation, which transfers knowledge from the source domain to the target domain, has still been a challenging problem. However, previous domain adaptation methods typically minimize the domain discrepancy by using the pseudo target labels. Since the pseudo labels can be noisy, which may cause misalignment and unsatisfying adaptation performance. To address the above challenges, we propose an information maximization adaptation network with label distribution priors. We revisit feature alignment in unsupervised domain adaptation from the perspective of distribution alignment, and find that learning discriminant feature representation requires to minimizing distribution discrepancy and maximizing source mutual information between the outputs of the classifier and feature representations. Due to domain shift, maximizing target mutual information may align features to incorrect class directly. We propose a weighted target mutual information by re-weighting the estimated mutual information via the mean prediction confidence in mini-batch, which can eliminate the negative impact of inaccurate estimation. In addition, we introduce a regularization term of label priors distribution to encourage the similarity to the real label distribution. Extensive experimental results on three benchmark datasets show that our proposed method can achieve remarkable results compared with previous methods.

Abstract:
Due to the excellent semantics extraction capabilities, deep learning methods have significantly progressed in salient object detection (SOD). However, these methods often require time-consuming pre-training and large training datasets with ground truth. To address these issues, by referring to the framework known as “deep image prior (DIP),” we propose a SOD method called deep label prior network (DLPNet), which consists of \mathcal A-stream and \mathcal B-stream. The \mathcal A-stream includes two cascaded UNets and a simple CNNs module to extract the initial saliency map, while the \mathcal B-stream contains only two cascaded UNets, which refines the extracted initial saliency map. Unlike most of the current deep learning methods, DLPNet views the SOD task as a conditional image generation problem, relying on only the internal prior of the input itself to generate the saliency map. Hence, our DLPNet does not require pre-training or large annotated / unannotated datasets. Furthermore, we propose a morphology operation scheme, which creates rich pseudo-labels for facilitating the updating of network weights. Extensive experiments demonstrate that our method outperforms state-of-the-art unsupervised techniques and is even comparable to state-of-the-art supervised and weakly supervised methods on different evaluation metrics.

Abstract:
Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global representation between videos and text descriptions and utilize the global contrastive loss to train the model. These designs overlook the local alignment and the word-level supervision signal. In this paper, we propose a new framework, called Align and Tell, for text-video retrieval. Compared to the previous work, our framework contains additional modules, i.e., two transformer decoders for local alignment and one captioning head to enhance the representation learning. First, we introduce a set of learnable queries to interact with both textual representations and video representations and project them to a fixed number of local features. After that, local contrastive learning is performed to complement the global comparison. Moreover, we design a video captioning head to provide additional supervision signals during training. This word-level supervision can enhance the visual presentation and alleviate the cross-modal gap. The captioning head can be removed during inference and does not introduce extra computational costs. Extensive empirical results demonstrate that our Align and Tell model can achieve state-of-the-art performance on four text-video retrieval datasets, including MSR-VTT, MSVD, LSMDC, and ActivityNet-Captions.

Abstract:
The ever-growing multimedia traffic has underscored the importance of effective multimedia codecs. Among them, the up-to-date lossy video coding standard, Versatile Video Coding (VVC), has been attracting attentions of video coding community. However, the gain of VVC is achieved at the cost of significant encoding complexity, which brings the need to realize fast encoder with comparable Rate Distortion (RD) performance. In this paper, we propose to optimize the VVC complexity at intra-frame prediction, with a two-stage framework of deep feature fusion and probability estimation. At the first stage, we employ the deep convolutional network to extract the spatial-temporal neighboring coding features. Then we fuse all reference features obtained by different convolutional kernels to determine an optimal intra coding depth. At the second stage, we employ a probability-based model and the spatial-temporal coherence to select the candidate partition modes within the optimal coding depth. Finally, these selected depths and partitions are executed whilst unnecessary computations are excluded. Experimental results on standard database demonstrate the superiority of proposed method, especially for High Definition (HD) and Ultra-HD (UHD) video sequences.

Abstract:
Recently, convolutional neural networks (CNNs) have provided a favoured prospect for authentically distorted image quality assessment (IQA). For good performance, most existing CNN-based methods rely on a large amount of labeled data for training, which is time-consuming and cumbersome to collect. By simultaneously exploiting few labeled data and many unlabeled data, we make a pioneering attempt to propose a semi-supervised framework (termed SSLIQA) with consistency-preserving dual-branch CNN for authentically distorted IQA in this paper. The proposed SSLIQA introduces a consistency-preserving strategy and transfers two kinds of consistency knowledge from the teacher branch to the student branch. Concretely, SSLIQA utilizes the sample prediction consistency to train the student to mimic output activations of individual examples represented by the teacher. Considering that subjects often refer to previous analogous cases to make scoring decisions, SSLIQA computes the semantic relation among different samples in a batch and encourages the consistency of sample semantic relation between two branches to explore extra quality-related information. Benefiting from the consistency-preserving strategy, we can exploit numerous unlabeled data to improve network's effectiveness and generalization. Experimental results on three authentically distorted IQA databases show that the proposed SSLIQA is stably effective under different student-teacher combinations and different labeled-to-unlabeled data ratios. In addition, it points out a new way on how to achieve higher performance with a smaller network.

Abstract:
This paper proposes a dense fusion transformer (DFT) framework to integrate textual, acoustic, and visual information for multimodal affective computing. DFT exploits a modality-shared transformer (MT) module to extract the modality-shared features by modelling unimodal, bimodal, and trimodal interactions jointly. MT constructs a series of dense fusion blocks to fuse utterance-level sequential features of the multiple modalities from the perspectives of low-level and high-level semantics. In particular, MT adopts local and global transformers to learn modality-shared representations by modelling inter- and intra-modality interactions. Furthermore, we devise a modality-specific representation (MR) module with a soft orthogonality constraint to penalize the distance between modality-specific and modality-shared representations, which are fused by a transformer to make affective predictions. Extensive experiments conducted on five public benchmark datasets show that DFT outperforms the state-of-the-art baselines.

Abstract:
The acquisition of densely-sampled light field (LF) images is costly, which hampers the applications of LF imaging technology in 3D reconstruction, digital refocusing, virtual reality, etc. To mitigate the obstacle, various approaches have been proposed to reconstruct densely-sampled LF images from sparsely-sampled ones. However, most existing methods still suffer from the non-Lambertian effect and large disparity issue. In this paper, we embrace the challenges by introducing a new paradigm for LF angular super-resolution (SR), which first explores the multi-scale spatial-angular correlations on the sparse sub-aperture images (SAIs) and then performs angular SR on macro-pixel features. In this way, we propose an efficient LF angular SR network, termed as EASR, with simple 3D (2D) CNNs and reshaping operations. The proposed EASR can extract effective feature representations on SAIs and can handle large disparities well by performing angular SR on macro-pixel features. Extensive comparisons with state-of-the-art methods demonstrate that our method achieves superior performance visually and quantitatively. Furthermore, our method achieves efficient angular SR by providing an excellent tradeoff between reconstruction performance and inference time.

Abstract:
Compression artifacts removal methods based on convolutional neural networks have attracted great attention. However, most existing methods require a specific trained model for a specific compression quality factor (QF), which inevitably leads to resource-consuming. Unfortunately, the QF is unknown in most practical applications, so it is intractable to choose a suitable model. In this work, we experimentally analyze the relationship between compression index estimation and compression artifacts removal. Based on the connection between them, we couple compression index estimation with compression artifacts removal into a unified network. A network named CRESNet is proposed, working for a wide range of QFs by integrating channel regulation with an exit strategy. Specifically, CRESNet adopts a multi-stage progressive structure with an exit strategy embedded to automatically select the optimal exit stage according to the estimated compression index reflecting the difficulty of the input sample. Benefiting from the exit strategy, CRESNet removes artifacts from slightly compressed images through a simple process while doing an elaborate process for severely compressed images. Furthermore, a compression-information-guided channel regulation (CICR) mechanism is developed to adaptively regulate feature maps based on the estimated compression index. CRESNet achieves a more elegant trade-off between artifacts removal and detail preservation in a resource-efficient manner. Experiments demonstrate that CRESNet achieves state-of-the-art performance.

Abstract:
Audio-visual cross-modal matching aims to explore the intrinsic correspondence between face images and audio clips. Existing methods usually focus on the salient features of identities between visual images and voice clips, while neglecting their subtle differences, which are crucial to distinguishing cross-modal samples. To deal with this problem, we propose a novel Dual-enhanced Siamese Adversarial Network (DSANet), which pursues the adversarial dual enhancement to highlight both salient and subtle features for robust audio-visual cross-modal matching. First, we designed a dual enhancement mechanism to enhance potential subtle features by randomly selecting a region feature for salient feature suppression, while enhancing salient features in the corresponding region to ensure the global discriminative ability. Second, to establish the correlation of subtle features in the process of eliminating cross-modal heterogeneity, we design a siamese adversarial structure to perform modal heterogeneity elimination for both enhanced salient and subtle features in a parallel manner. Moreover, we propose an adaptive masked cross-entropy loss to force the network to focus on the feature differences among hard classes. Experiments on public benchmark datasets validate the effectiveness of the proposed algorithm.

Abstract:
Commercial motion-capture systems produce excell- ent in-studio reconstructions, but offer no comparable solution for acquisition in everyday environments. We present a system for acquiring motions almost anywhere. This wearable system gathers ultrasonic time-of-flight and inertial measurements with a set of inexpensive miniature sensors worn on the garment. After recording, the information is combined using an Extended Kalman Filter to reconstruct joint configurations of a body. Experimental results show that even motions that are traditionally difficult to acquire are recorded with ease within their natural settings. Although our prototype does not reliably recover the global transformation, we show that the resulting motions are visually similar to the original ones, and that the combined acoustic and intertial system reduces the drift commonly observed in purely inertial systems. Our final results suggest that this system could become a versatile input device for a variety of augmented-reality applications.

Abstract:
To understand human behaviors, action recognition based on videos is a common approach. Compared with image-based action recognition, videos provide much more information, reducing the ambiguity of actions. In the last decade, many works focus on datasets, novel models and learning approaches have improved video action recognition to a higher level. However, there are challenges and unsolved problems, in particular in sports analytics where data collection and labeling are more sophisticated, requiring people with domain knowledge and even sport professionals to annotate data. In addition, the actions could be extremely fast and it becomes difficult to recognize them. Moreover, in team sports like football and basketball, one action could involve multiple players, and to correctly recognize them, we need to analyze all players, which is relatively complicated. In this paper, we present a survey on video action recognition for sports analytics. We introduce more than ten types of sports, including team sports, such as football, basketball, volleyball, hockey and individual sports, such as figure skating, gymnastics, table tennis, tennis, diving and badminton. Then we compare numerous existing frameworks for sports analysis to present status quo of video action recognition in both team sports and individual sports. Finally, we discuss the challenges and unsolved problems in this area and to facilitate sports analytics, we develop a toolbox using PaddlePaddle, which supports football, basketball, table tennis and figure skating action recognition.

Abstract:
Few shot semantic segmentation has been proposed to enhance the generalization ability of traditional models with limited data. Previous works mainly focus on the supervised tasks, while limited amount of work is explored for the weakly supervised tasks. Weakly supervised semantic segmentation has become an active research area because weakly supervised labels effectively reduce the annotation cost of visual tasks. To this end, we propose a weakly supervised few-shot semantic segmentation model based on the meta learning framework, which utilizes prior knowledge and adjusts itself according to new tasks. Thereupon then, the proposed network is capable of both high efficiency and generalization ability to new tasks. In the pseudo mask generation stage, we develop a WRCAM method with the channel-spatial attention mechanism to refine the coverage size of targets in pseudo masks. In the few-shot semantic segmentation stage, the optimization based meta learning method is used to realize few-shot semantic segmentation by virtue of the refined pseudo masks. The experimental results show that the proposed method not only significantly outperforms weakly supervised SOTA methods, but also could be comparative to some supervised SOTA methods.

Abstract:
Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain with a correlative distribution. Numerous existing approaches process this hard nut by directly matching the marginal distribution between two domains, which confront the obstacle of rough alignment and blurred decision boundary. Recent advances in UDA introduce target pseudo-label and subdomain adaptation to reduce misalignment and distribution discrepancy. Whereas, they frequently ignore that the production of target pseudo-label is so dependent on the source-trained classifier, which without reasonable restriction to discriminate generated pseudo-label is whether confident. Meanwhile, many methods in the subdomain alignment metric ignore exploring the potential distribution discrepancy between same-class samples of the intra-domain. To address these two issues simultaneously, this paper proposes a Cycle Consistency based Pseudo Label and Fine Alignment (CCPLFA) approach for UDA. In particular, firstly, a novel cycle-consistency based pseudo label module is designed, which is a simple yet effective way to alleviate the noise of pseudo labels and improve their semantic correctness. Secondly, we develop a Fine-Alignment distribution matching metric. Which can maximize the feature distribution density of intra-class cross-domains and not overlook the distribution structure of the global aspect. Comprehensive experiment results on four benchmarks demonstrate the capability of plug and play and the well generalization performance of our proposed method.

Abstract:
For a natural scene with nonuniform environment light, the captured visible images are always under- or over-exposed because of the limited dynamic range of digital imaging devices. Multi-exposure image fusion (MEF) is a mainstream and effective solution. For a local region that has friendly visual effect in one exposure setting but extremely bad-exposed in another, most existing MEF methods have the ability to transfer the scene detail information to the fused images. However, they will be affected by the over-high or -low light inevitably thus resulting in local visibility reduction. To address this issue, we propose an adaptive clarity evaluation-guided network with illumination correction for MEF in a coarse-to-fine manner, which is termed as ACE-MEF. To be specific, our ACE-MEF is mainly composed of two modules: clarity preservation network (CPN) and illumination adjustment network (IAN). Based on the adaptive clarity evaluation, CPN could be trained to coarsely preserve the environment light and texture details of the clearer regions in source images. Therefore, the need for labeled reference images that are time-consuming to obtain could be mitigated. By measuring the parameter maps of gamma function, IAN is able to refine and correct the local bad-exposed regions so that more details could be further revealed. Extensive experiments demonstrate that our method outperforms multiple state-of-the-art algorithms qualitatively and quantitatively.

Abstract:
The advances in sensors and data processing technologies enrich the types of 3D point clouds acquirement, empowering numerous extensive and novel applications such as 3D reconstruction in various scenarios. However, the hole defects affect the accuracy and fidelity of the acquired point clouds, hindering further development and application of 3D point clouds. Aiming at the hole defects in point clouds, a Bayesian hole inpainting algorithm for the half-organized point cloud is proposed, where the point cloud is obtained by a structured-light section system. The algorithm establishes a Bayesian probability model in the hole region, which adopts specific distributions of the point cloud to estimate the maximum likelihood parameter. Simulation and experimental results show that the proposed approach outperforms other competing algorithms significantly in repairing various types of holes, both in objective and subjective qualities. In addition, the proposed algorithm has better scalabilities in the cases of wrong topology definition and self-intersection confusion. This is the first algorithm specially designed for hole inpainting in half-organized point clouds, which maximizes the comprehensive consideration of local features and global optimization, supplemented by targeted prior knowledge of density, Riemannian manifold, and discrete attributes.

Abstract:
Temporal action localization is a challenging task in computer vision, and it tries to find the start time and the end time of the actions and predict their categories. However, compared to temporal action localization, weakly supervised temporal action localization (WTAL) is a more challenging task due to its poor annotations. With only video-level annotation, some background frames, similar to actions, would be classified as actions and produce inaccurate results. In addition, the two-stream fusion problem, ignored previously, also needs to be further considered. To resolve these issues, we propose a novel action saliency and context-aware network (ASCN) for WTAL tasks. Specifically, the temporal saliency and context module is designed to enhance the global saliency and context information of the RGB and flow features to suppress the backgrounds and enhance the actions. In addition, a hybrid attention mechanism using frame differences and two-stream attention is designed to model the local action context information and further enlarge the scores of the potential action regions and suppress the background regions. Finally, to obtain two-stream consistency and solve the fusion problem, we use the similarity loss and a channel self-attention module to adaptively fuse the enhanced RGB and flow features. Extensive experiments demonstrate that ASCN can outperform all of the SOTA WTAL methods on THUMOS14 dataset and ActivityNet1.3 dataset with an average mAP that can reach 37.2% on THUMOS14 dataset and attains an average mAP of 26.3% on ActivityNet1.3 dataset. On ActivityNet1.2 dataset, ASCN can also obtain comparable results.

Affiliations: Financial Intelligence and Financial Engineering Key Laboratory of Sichuan Province, Institute of Digital Economy and Interdisciplinary Science Innovation, School of Computer and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu, China; State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China; Information Engineering College, Capital Normal University, Beijing, China; School of Electronic Engineering, Dublin City University, Glasnevin, Dublin, Ireland

Abstract:
Providing premium panoramic livecast services to worldwide viewers considering their ultra-high data rate and delay-sensitivity is a significant challenge in the current network delivery environment. Therefore, it is important to design an efficient way of improving viewer quality of experience while conserving bandwidth resources. In this context, this paper introduces a novel cost-efficient federated transmission framework called FedLive and a set of algorithms to support it. First a gradient-based clustering method is proposed to group the geo-distributed viewers with similar viewing behavior into content delivery alliances by exploiting the geometric properties of the gradient loss. Next, a Reinforced Variational Inference (RVI) structure-based approach is proposed to assist with the collaborative training of the viewer field of view (FoV) prediction model while also accelerating the tile delivery process. A novel prediction-based asynchronous delivery algorithm is designed in which both the high accuracy FoV prediction and efficient live 360^\circ video transmission are achieved in a decentralized manner. FedLive was implemented for testing and an open source code is made available. Finally, the proposed solution was evaluated against a benchmark and three alternative state-of-the-art solutions using a real-world dataset. The experimental results show that our approach provides the highest prediction accuracy, better service performance, and saves bandwidth when compared with the other solutions.

Abstract:
Numerous CNN-based algorithms have been proposed to reconstruct high-quality face images. However, the inability of convolution operation to model long-distance relationships limits the performance of the CNN-based methods. Moreover, in the high-resolution (HR) image reconstruction stage, with the well decoded feature representations, more efficient architecture design can be explored to synthesize pixel-level image details. In this work, we propose a spatial attention-guided CNN-Transformer aggregation network (SCTANet) for face image super-resolution (FSR) tasks. The core component in the deep feature extraction stage is the Hybrid Attention Aggregation (HAA) block. The HAA block has two parallel paths, one for the Residual Spatial Attention (RSA) block, the other for the Multi-scale Patch embedding and Spatial-attention Masked Transformer (MPSMT) block. The HAA block combines the strengths of CNN and transformer to effectively exploit both local and global information. For the reconstruction stage, we propose to use the Sub-pixel MLP-based Upsampling (SMU) module instead of the conventional CNN architecture. The SMU module promotes the reconstruction of pixel-level image details and reduces computational complexity. Extensive experiments on both synthetic and real-world face datasets demonstrate the superiority of our proposed SCTANet over state-of-the-art methods.

Abstract:
Image privacy protection and management face many challenges, such as privacy disclosure, copyright dispute, and traceability difficulties, with the development of big data. Reversible data hiding in encrypted images (RDHEI) has been widely considered as an effective means to tackle these challenges. In this paper, a RDHEI based on time-varying Huffman coding table (TV-HCT) method is proposed to improve the security, embedding rate (ER) and efficiency. First, the initial HCT is generated according to the prediction errors of an image, which can improve compression performance. And then, the TV-HCT is obtained by scrambling equal-length codewords in the initial HCT using timestamps. This realizes the time variability of compression coding stream (CCS) of an image in that the image TV-HCT has large change space. Analysis shows that the average change space of TV-HCT in UCID is 3.97×10327, and the average ER of three databases is more than 0.44 bpp higher than the existing algorithms. Finally, the CCS is encrypted using the designed index class scrambling method to balance complexity and security. The proposed method not only strengthens the security against brute force attack and differential attack, but also improves ER and efficiency of the RDHEI technique. Experimental results and performance analysis demonstrate that the proposed algorithm outperforms the state-of-the-art RDHEI algorithms in terms of the security, ER and complexity.

Abstract:
Various structural relations/dependencies exist among human body joints, which makes it possible to estimate 3D poses from 2D sources. The current research on 3D human pose estimation (3D-HPE for short) mainly focuses on structural information from a specific perspective. However, this information cannot facilitate 2D-to-3D pose lifting. This paper presents a novel and efficient multi-layer perceptron with a joint-coordinate gating (MLP-JCG) model, exploring and utilizing both the local and global structural information to perform 3D pose estimations. Specifically, MLP-JCG contains two independent MLP blocks, i.e., joint-mixing MLP and coordinate-mixing MLP, which solely act on the joint and coordinate axes in modelling their local structural information. For the global structural information, we first explore two kinds of global statistics from the pose matrix embeddings, which are referred to as the dynamics aggregated along the joint/coordinate axis. Then, we propose two kinds of gating units to elementwisely contextualize the features learned from MLP blocks. All the model components are designed based on MLP, making the MLP-JCG easy to implement and train. We conduct experiments on three 3D-HPE benchmarks, and the results demonstrate the superior effectiveness and efficiency of the proposed approach.

Abstract:
3D Multi-Object Tracking (MOT) in dynamic point cloud sequences is a fundamental research problem for several downstream tasks such as motion planning and action recognition. Existing methods usually rely on the traditional tracking-by-detection (TBD) paradigm, which performs the tracking based on the results achieved by dedicated detectors. However, this two-stage framework usually cannot sufficiently exploit spatial-temporal information and end-to-end optimization, leading to sub-optimal tracking performance, especially when the object is partially or completely occluded. In this article, we propose a joint detection and tracking framework named CenterTube for dynamic point cloud sequences. The key to our approach is to formulate the problem of multiple object trajectory predictions as 4D tubelet detections. In particular, the proposed CenterTube is composed of three head branches, including a center branch, a regression branch, and a movement branch for the estimation of object center, object size, instance movement, and frame interval, respectively. Additionally, a Tube BEV-IoU (TB-IoU) is also presented to link the generated clip-level tubelets and form the final tracks. Extensive experiments conducted on the KITTI-MOT and nuScenes datasets demonstrate that our model achieves competitive performances even if no ready-made detection results is adopted.

Abstract:
Considering that the human brain always follows a coarse-to-fine (low-to-high spatial frequency) visual processing and fusion mechanism, we propose a coarse-to-fine feedback guidance based stereo image quality assessment (SIQA) network which considers a coarse-to-fine feedback guidance and adaptive dominant eye mechanism. The proposed network consists of two main sub-network streams, each of which has three branches to extract low, middle and high spatial frequency information in parallel. To better realize the guidance of the high-level features in the low spatial frequency branch to the low-level features in the high spatial frequency branch, an information feedback guidance module (IFGM) is proposed, which realizes a top-down guidance mechanism in each sub-network stream. Simultaneously, according to the theory of ocular dominance in human visual system (HVS), we design an adaptive bi-directional parallax-based binocular fusion module (BPBFM), which synthesizes two types of fusion feature by taking the left and right view features as dominant eye input. Furthermore, in order to obtain the better perceptual quality of stereo images, we design a weighted fusion strategy to weigh the quality scores from the two types of fusion features obtained by using an ensemble model with two multi-layer perceptrons (MLPs). The experimental results on four public stereo image datasets show that the proposed method is superior to the mainstream metrics and achieves an excellent performance.

Abstract:
Egocentric vision has gained increasing popularity recently, opening new avenues for human-centric applications. However, the use of the egocentric fisheye cameras allows wide angle coverage but image distortion is introduced along with strong human body self-occlusion imposing significant challenges in data processing and model reconstruction. Unlike previous work only leveraging synthetic data for model training, this paper presents a new real-world EgoCentric Human Pose (ECHP) dataset. To tackle the difficulty of collecting 3D ground truth using motion capture systems, we simultaneously collect images from a head-mounted egocentric fisheye camera as well as from two third-person-view cameras, circumventing the environmental restrictions. By using self-supervised learning under multi-view constraints, we propose a simple yet effective framework, namely EgoFish3D, for egocentric 3D pose estimation from a single image in different real-world scenarios. The proposed EgoFish3D incorporates three main modules. 1) The third-person-view module takes two exocentric images as input and estimates the 3D pose represented in the third-person camera frame; 2) the egocentric module predicts the 3D pose in the egocentric camera frame; and 3) the interactive module estimates the rotation matrix between the third-person and the egocentric views. Experimental results on our ECHP dataset and existing benchmark datasets demonstrate the effectiveness of the proposed EgoFish3D, which can achieve superior performance to existing methods.

Abstract:
Existing methods for few-shot speaker identification (FSSI) obtain high accuracy, but their computational complexities and model sizes need to be reduced for lightweight applications. In this work, we propose a FSSI method using a lightweight prototypical network with the final goal to implement the FSSI on intelligent terminals with limited resources, such as smart watches and smart speakers. In the proposed prototypical network, an embedding module is designed to perform feature grouping for reducing the memory requirement and computational complexity, and feature interaction for enhancing the representational ability of the learned speaker embedding. In the proposed embedding module, audio feature of each speech sample is split into several low-dimensional feature subsets that are transformed by a recurrent convolutional block in parallel. Then, the operations of averaging, addition, concatenation, element-wise summation and statistics pooling are sequentially executed to learn a speaker embedding for each speech sample. The recurrent convolutional block consists of a block of bidirectional long short-term memory, and a block of de-redundancy convolution in which feature grouping and interaction are conducted too. Our method is compared to baseline methods on three datasets that are selected from three public speech corpora (VoxCeleb1, VoxCeleb2, and LibriSpeech). The results show that our method obtains higher accuracy under several conditions, and has advantages over all baseline methods in computational complexity and model size.

Abstract:
Limited by objectively poor lighting conditions and hardware devices, low-light images with low visual quality and low visibility are inevitable in the real world. Accurate local details and reasonable global information play their essential and distinct roles in low-light image enhancement: local details contribute to fine textures, while global information is critical for a proper understanding of the global brightness level. In this article, we focus on integrating local and global aspects to achieve high-quality low-light image enhancement by proposing the synchronous multi-scale low-light enhancement network (SMNet). A synchronous multi-scale representation learning structure and a global feature recalibration module are adopted in SMNet. Different from the traditional multi-scale feature learning architecture, SMNet carries out the multi-scale representation learning in a synchronous way: we first calculate the rough contextual representations in a top-down manner and then learn multi-scale representations in a bottom-up way to generate representations with rich local details. To acquire global brightness information, a global feature recalibration module (GFRM) is applied after the synchronous multi-scale representations to perceive and exploit proper global information by global pooling and projection to recalibrate channel weights globally. The synchronous multi-scale representation and GFRM compose the basic local-and-global block. Experimental results on mainstream real-world dataset LOL and synthetic dataset MIT-Adobe FiveK show that the proposed SMNet not only leads the way on objective metrics (0.41/2.31 improvement of PSNR on two datasets) but is also superior in subjective comparisons compared with typical SoTA methods.

Affiliations: Key Laboratory of Complex Systems Modeling and Simulation, School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, China; Zhejiang International Studies University, Hangzhou, China; Department of Medical Oncology, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, China; Regional Medical Center for National Institute of Respiratory Diseases, Sir Run Run Shaw Hospital, College of Medicine, Zhejiang University, Hangzhou, China; School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, China

Abstract:
Medical image report generation (MeIRG) aims at generating associated diagnosis descriptions with natural language sentences from medical images, which is essential in the computer-aided diagnosis system. Nevertheless, this task remains challenging in that medical images and linguistic expressions should be understood jointly which however show great discrepancies in the modality. To fill this visual-to-semantic gap, we propose a novel framework that follows the encoder-decoder pipeline. Our framework is characterized by encoding both deep visual and semantic embeddings through a triple-branch network (TriNet) during the encoding phase. The visual attention branch captures attended visual embeddings from medical images with the soft-attention mechanism. The medical report (MeRP) embedding branch predicts semantic report embeddings. The embedding branch of medical subject headings (MeSH) obtains semantic embeddings of related medical tags as complementary information. Then, outputs of these branches are fused and fed into a decoder for the report generation. Experimental results on two benchmark datasets have demonstrated the excellent performance of our method. Related codes are available at https://github.com/yangyan22/Medical-Report-Generation-TriNet.

Abstract:
Watching 360^\circ videos using Virtual Reality (VR) head-mounted displays (HMDs) provides interactive and immersive experiences, where videos can evoke different emotions. Existing emotion self-report techniques within VR however are either retrospective or interrupt the immersive experience. To address this, we introduce the Continuous Physiological and Behavioral Emotion Annotation Dataset for 360^\circ Videos (CEAP-360VR). We conducted a controlled study (N=32) where participants used a Vive Pro Eye HMD to watch eight validated affective 360^\circ video clips, and annotated their valence and arousal (V-A) continuously. We collected (a) behavioral (head and eye movements; pupillometry) signals (b) physiological (heart rate, skin temperature, electrodermal activity) responses (c) momentary emotion self-reports (d) within-VR discrete emotion ratings (e) motion sickness, presence, and workload. We show the consistency of continuous annotation trajectories and verify their mean V-A annotations. We find high consistency between viewed 360^\circ video regions across subjects, with higher consistency for eye than head movements. We furthermore run baseline classification experiments, where Random Forest classifiers with 2s segments show good accuracies for subject-independent models: 66.80% (V) and 64.26% (A) for binary classification; 49.92% (V) and 52.20% (A) for 3-class classification. Our open dataset allows further experiments with continuous emotion self-reports collected in 360^\circ VR environments, which can enable automatic assessment of immersive Quality of Experience (QoE) andmomentary affective states.

Abstract:
In this paper, we introduce a novel 6-D representation of plenoptic point clouds, enabling joint, non-separable transform coding of plenoptic signals defined along both spatial and angular (viewpoint) dimensions. This 6-D representation, which is built in a global coordinate system, can be used in both multi-camera studio capture and video fly-by capture scenarios, with various viewpoint (camera) arrangements and densities. We show that both the Region-Adaptive Hierarchical Transform (RAHT) and the Graph Fourier Transform (GFT) can be extended to the proposed 6-D representation to enable the non-separable transform coding. Our method is applicable to plenoptic data with either dense or sparse sets of viewpoints, and to complete or incomplete plenoptic data, while the state-of-the-art RAHT-KLT method, which is separable in spatial and angular dimensions, is applicable only to complete plenoptic data. The “complete” plenoptic data refers to data that has, for each spatial point, one colour for every viewpoint (ignoring any occlusions), while “incomplete” data has colours only for the visible surface points at each viewpoint. We demonstrate that the proposed 6-D RAHT and 6-D GFT compression methods are able to outperform the state-of-the-art RAHT-KLT method on 3-D objects with various levels of surface specularity, and captured with different camera arrangements and different degrees of viewpoint sparsity.

Abstract:
In recent years, Light Field (LF) video has grabbed much attention as an emerging form of immersive media. LF collects, through a lens matrix, light information emanating in every direction, and obtains rich information about the scene, providing users with an immersive 6 Degrees of Freedom (DoF) experience. The visual content between different viewpoints is highly homogenized, suggesting the possibility of good compression and encoding. However, most fixed-structure LF coding schemes are difficult to adapt to the real-time requirements of different LF applications and best-effort network conditions causing packet loss. In this paper, we propose a dynamic adaptive LF video transmission scheme that can achieve high compression and yet provide near-distortion-free LF video when the network condition is stable. Additionally, for unstable network conditions a description scheduling algorithm is proposed, which can decode the LF video with the highest possible quality even if partial data cannot be received completely and/or timely. We achieve this by designing a Multiple Description Coding (MDC) based solution to transport the LF video compressed by a Graph Neural Network (GNN) model. Experimental results show that the scheduling algorithm can improve the quality of the decoding results by 3% to 15%. Compared with other similar schemes, our system greatly improves the reliability of the video streaming system against packet loss/error and supports heterogeneous receivers.

Abstract:
An image-based virtual try-on system transfers an in-shop garment to the corresponding garment region of a reference person, which has huge application potential and commercial value in online clothing shopping. Existing methods have difficulty preserving garment texture and body details because of rough garment alignment and imperfect detail-retention strategies. To address this problem, we propose a virtual try-on network based on semantic constraints and flow alignment. The key idea of the framework is as follows: 1) a global-local semantic predictor (GLSP) is proposed to generate a reasonable target semantic map, which clearly guides the correct alignment of the in-shop garment with the body and the generation of try-on result; and 2) a novel appearance flow-based garment alignment network (AFGAN) is proposed to align the in-shop garment with the body, which is important to preserve maximum garment detail and ensure natural and realistic warping; and 3) we propose a synthesis strategy to integrate the aligned garment and the human body to preserve maximum body detail for generating a realistic result and preventing cross-occlusion and pixel confusion between different body parts. Experiments on the existing benchmark dataset demonstrate that the proposed method achieves the best performance on qualitative and quantitative experiments among the state-of-the-art virtual try-on techniques.

Abstract:
Recognition of emotions in user-generated videos has attracted considerable research attention. Most existing approaches focus on learning frame-level features and fail to consider frame-level emotion intensities which are critical for video representation. In this research, we aim to extract frame-level features and emotion intensities through transferring emotional information from an image emotion dataset. To achieve this goal, we propose an end-to-end network for joint emotion recognition and intensity learning with unsupervised adversarial adaptation. The proposed network consists of a classification stream, an intensity learning stream and an adversarial adaptation module. The classification stream is used to generate pseudo intensity maps with the class activation mapping method to train the intensity learning subnetwork. The intensity learning stream is built upon an improved feature pyramid network in which features from different scales are cross-connected. The adversarial adaptation module is employed to reduce the domain difference between the source dataset and target video frames. By aligning cross domain features, we enable our network to learn on the source data while generalizing to video frames. Finally, we apply a weighted sum pooling method to frame-level features and emotion intensities to generate video-level features. We evaluate the proposed method on two benchmark datasets, i.e., VideoEmotion-8 and Ekman-6. The experimental results show that the proposed method achieves improved performance compared to previous state-of-the-art methods.

Abstract:
Quantizing the floating-point weights and activations of deep convolutional neural networks to fixed-point representation yields reduced memory footprints and inference time. Recently, efforts have been afoot towards zero-shot quantization that does not require original unlabelled training samples of a given task. These best-published works heavily rely on the learned batch normalization (BN) parameters to infer the range of the activations for quantization. In particular, these methods are built upon either empirical estimation framework or the data distillation approach, for computing the range of the activations. However, the performance of such schemes severely degrades when presented with a network that does not accommodate BN layers. In this line of thought, we propose a generalized zero-shot quantization (GZSQ) framework that neither requires original data nor relies on BN layer statistics. We have utilized the data distillation approach and leveraged only the pre-trained weights of the model to estimate enriched data for range calibration of the activations. To the best of our knowledge, this is the first work that utilizes the distribution of the pre-trained weights to assist the process of zero-shot quantization. The proposed scheme has significantly outperformed the existing zero-shot works, e.g., an improvement of ～ 33% in classification accuracy for MobileNetV2 and several other models that are w & w/o BN layers, for a variety of tasks. We have also demonstrated the efficacy of the proposed work across multiple open-source quantization frameworks. Importantly, our work is the first attempt towards the post-training zero-shot quantization of futuristic unnormalized deep neural networks.

Abstract:
Recently, solving the crowd counting problem under occlusion and complex perspective is a hot but difficult topic. Existing methods mainly constructed counters in parallel perspective, but when facing complex perspective, such as the influences of height difference and heavy occlusions, they fail to get good accuracy. To alleviate these problems, this work proposes a novel and interesting framework NOOMP (Need Only One More Point) for perspective adaptation crowd counting task in complex nature scenes. Firstly, this work considers that the common scenes in our daily life usually have the height difference, which brings complex perspective to crowd counting. So, a new labeled method, Absolute-geometry Gaussian Generation is proposed, which only needs one more point for each person in image and gets better accuracy. Secondly, the NOOMP framework consists of meta-learning structure and uses the few-shot way to train the counting model, which can implement the perspective adaptation effective and solve the problem of high label cost. Thirdly, for fitting the characteristic of few-shot learning, this work proposes a new Multi-head Parallel Network (MPNet) for NOOMP. The feature of crowd is extracted by MPNet, which is a hybrid structure composed of shallow network and deep network. This network can save the features of shallow network and the deeper network effectively, which makes MPNet performs well in NOOMP. In addition, this work collects a new dataset, named Multiple Height Differences in Mall (MHDM) for NOOMP, which contains images of different views and height differences from shopping malls and supermarkets. Experiments based on MHDM and other benchmarks show that the NOOMP has good performances in model accuracy and works well for solving perspective change problem.

Abstract:
Head pose estimation is an important step for many human-computer interaction applications such as face detection, facial recognition, and facial expression classification. Accurate head pose estimation benefits these applications that require face images as the input. Most head pose estimation methods suffer from perspective distortion because the users do not always align their face perfectly with the camera. This paper presents a new approach that uses image rectification to reduce the negative effect of perspective distortion and a lightweight convolutional neural network to obtain highly accurate head pose estimations. The proposed method calculates the angle between the optical axis of the camera and the projection vector of the center of the face. The face image is rectified using this estimated angle through perspective transformation. A lightweight network that is only 0.88 MB in size is designed to take the rectified face image as the input to perform head pose estimation. The output of the network, the head pose estimation of the rectified face image, is transformed back to the camera coordinate system as the final head pose estimation. Experiments on public benchmark datasets show that the proposed image rectification method and the newly designed lightweight network improve the accuracy of head pose estimation remarkably. Compared with state-of-the-art methods, our approach achieves both higher accuracy and faster processing speed.

Abstract:
Referring Expression Comprehension (REC) and Generation (REG) have become one of the most important tasks in visual reasoning, since it is an essential step for many vision-and-language tasks such as visual question answering or visual dialogue. However, it has not been widely used in many downstream tasks, mainly for the following reasons: 1) mainstream two-stage methods rely on additional annotations or off-the-shelf detectors to generate proposals. It would heavily degrade the generalization ability of models and lead to inevitable error accumulation. 2) Although one-stage strategies for REC have been proposed, these methods have to depend on lots of hyper-parameters (such as anchors) to generate bounding box. In this paper, we present a proposal-free one-stage (PFOS) framework that can directly regress the region-of-interest from the image or generate unambiguous descriptions in an end-to-end manner. Instead of using the dominant two-stage fashion, we take the dense-grid of images as input for a cross-attention transformer that learns multi-modal correspondences. The final bounding box or sentence is directly predicted from the image without the anchor selection or the computation of visual difference. Furthermore, we expand the traditional two-stage listener-speaker framework to jointly train by a one-stage learning paradigm. Our model achieves state-of-the-art performance on both accuracy and speed for comprehension and competitive results for generation.

Abstract:
Streaming of live 360-degree video allows users to follow a live event from any view point and has already been deployed on some commercial platforms. However, the current systems can only stream the video at relatively low-quality because the entire 360-degree video is delivered to the users under limited bandwidth. Streaming video falling into user field of view (FoV) can improve bandwidth efficiency of 360-degree video delivery. In this paper, we propose to use the idea of “flocking” to simultaneously improve the accuracy of user FoV prediction and video delivery efficiency for live 360-degree video streaming. By assigning variable playback latencies to users in a streaming session based on their network conditions, a “streaming flock” is formed and led by “strong” users with low playback latencies in the front of the flock. We propose a long short-term memory (LSTM) based collaborative FoV prediction scheme where the FoV traces of users in the front of the flock are utilized to predict the FoV of users behind them. Given a predicted FoV, we develop an optimal rate allocation strategy to maximize the perceptual quality. By conducting experiments using real-world user FoV traces and LTE/5 G network bandwidth traces, we evaluate the gains of the proposed strategies over several benchmarks. Our experimental results demonstrate that the proposed streaming system can increase the overall quality dramatically by about 10 dB compared with heuristic FoV prediction strategy. In addition, the network-aware flocking formation can further reduce the video freeze without influencing video quality.

Abstract:
The problem of video-text retrieval, which searches videos via natural language descriptions or vice versa, has attracted growing attention due to the explosive scale of videos produced every day. The dominant approaches for this problem follow the pipeline that firstly learns compact feature representations of videos and texts, and then jointly embeds them into a common feature space where matched video-text pairs are close and unmatched pairs are far away. However, most of them neither consider the structural similarities among cross-modal samples in a global view, nor leverage useful information from other relevant retrieval processes. We argue that both information has great potential for video-text retrieval. In this paper, we treat the relevant retrieval processes as auxiliary tasks and we extract useful knowledge from them by exploiting structural similarities via Graph Neural Networks (GNNs). We then progressively transfer the knowledge from auxiliary tasks in a general-to-specific manner to assist the main task of the current retrieval process. Specifically, for the retrieval of the given query, we first construct a sequence of query-graphs whose central queries are chosen from distant to close to the given query. Then we conduct knowledge-guided message passing in each query-graph to exploit regional structural similarities and gather knowledge of different levels from the updated query-graphs with a knowledge-based attention mechanism. Finally, we transfer the extracted useful knowledge from general to specific to assist the current retrieval process. Extensive experimental results show that our model outperforms the state-of-the-arts on four benchmarks.

Abstract:
Image and text are dual modalities of our semantic interpretation. Changing images based on text descriptions allows us to imagine and visualize the world (a.k.a. text-based image manipulation (TIM)). In this paper, we introduce a framework that combines TIM with change captioning (CC) and utilizes the benefits of co-training. CC aims to describe what has changed in a scene and can be regarded as the inverse version of TIM where both tasks rely on generative networks. These generative networks can be regarded as data producers of each other and unlike previous methods, we discover that integrating their learning procedures can benefit both. Since the CC module describes differences between two images as text, the CC module can be used as evaluation criteria and provide feedback. Furthermore, we utilize a shared attention mechanism in TIM and CC modules to localize towards prominent regions as well as enabling a change-aware discriminator. In the opposite direction, the output image synthesized by the TIM module can be assessed with the CC module, by checking whether the ground truth text description can be redescribed. Following this insight, not only do we boost the training of the TIM module, but we also utilize the TIM module as additional supervision for the CC training. Experimental results show that our framework outperforms existing TIM methods on several datasets substantially and we achieve marginal improvements in the CC module. To our best knowledge, this is the first study dedicated to the joint training of TIM and CC tasks.

Abstract:
Multimodal abstractive summarization for videos is an emerging task that aims to generate a summary from multi-source information (i.e., video, audio transcript). The challenge is how to merge multimodal long sequences to capture rich semantic information without allowing possible noise from either lengthy modal sequence to degrade the other modality and thus hurt the entire model. To address the issues, we propose a multistage fusion network with forget gate (MFFG), which selectively integrates multi-source information through the cross-fusion in encoding and hierarchical fusion in decoding between modalities, and design a fusion forget gate module to suppress the potential multimodal noise flow of multi-source long sequence. Meanwhile, considering that the source text in this task is lengthy and has the same distribution as the output summary text, we inherit the partial structure of the MFFG model and again propose its variant, single-stage fusion network with forget gate (SFFG), which simplifies the fusion schema, and leverages the long source text to enhance the representation of the target summary. Experimental results on How2 dataset and How2-300 dataset demonstrate the superiority of the two multimodal fusion methods. Further, we provide a version of ASR transcription data of How2 dataset to evaluate model performance under noisy scenarios, and experimental results show obvious advantages of our proposed models over prior systems.

Abstract:
In this study, we investigate language-level video object segmentation, where first-frame language annotation is used to describe the target object. Because a language label is typically compatible with all frames in a video, the proposed method can choose the most suitable starting frame to mitigate initialization failure. Apart from extracting the visual feature from a static video frame, a motion-language score based on optical flow is also proposed to describe moving objects more accurately. Scores of multiple standards are then aggregated using an attention-based mechanism to predict the final result. The proposed method is evaluated on four widely-used video object segmentation datasets, including the DAVIS 2017, DAVIS 2016, SegTrack V2 and YouTubeObject datasets, and a novel accuracy measured as mean region similarity is obtained on both the DAVIS 2017 (67.2%) and DAVIS 2016 (83.5%) datasets. The code will be published.

Abstract:
Scene parsing is a fundamental task in computer vision. Various RGB-D (color and depth) scene parsing methods based on fully convolutional networks have achieved excellent performance. However, color and depth information are different in nature and existing methods cannot optimize the cooperation of high-level and low-level information when aggregating modal information, which introduces noise or loss of key information in the aggregated features and generates inaccurate segmentation maps. The features extracted from the depth branch are weak because of the low quality of the depth map, which results in unsatisfactory feature representation. To address these drawbacks, we propose a progressive guided fusion and depth enhancement network (PGDENet) for RGB-D indoor scene parsing. First, high-quality RGB images are used to improve depth data through a depth enhancement module, in which the depth maps are strengthened in terms of channel and spatial correlations. Then, we integrate information from the RGB and enhance depth modalities using a progressive complementary fusion module, in which we start with high-level semantic information and move down layerwise to guide the fusion of adjacent layers while reducing hierarchy-based differences. Extensive experiments are conducted on two public indoor scene datasets, and the results show that the proposed PGDENet outperforms state-of-the-art methods in RGB-D scene parsing.

Abstract:
In today's Internet, bandwidth dynamics are inevitable, and hence, the bitrate for live streaming applications should also be dynamically adjusted. However, in existing HTTP-based adaptive streaming (HAS), bitrate switching can only be performed at segment boundaries, making decisions unresponsive and often inaccurate. In this paper, we start from a close investigation on the impact of the segment length in HAS and accordingly present VHAS, an extension towards intelligent variable-length segmentation, which makes client-side decisions based on the massive amount of real-time information from the network and viewers. VHAS implements a smart trigger mechanism that balances accuracy and overhead for variable-length segmentation. We further develop an adaptive bitrate switching algorithm with data-driven I-frame prediction, which is tailored to individual viewers to minimize bitrate mismatches. We evaluate VHAS via extensive trace-driven simulations, and our results demonstrate that compared with state-of-the-art solutions, VHAS achieves 15%–49% gains in QoE, with a noticeable bandwidth reduction of 37%–57%.

Abstract:
Estimating 3D human body shapes and poses from videos is a challenging computer vision task. The intrinsic temporal information embedded in adjacent frames is helpful in making accurate estimations. Existing approaches learn temporal features of the target frames simply by aggregating features of their adjacent frames, using off-the-shelf deep neural networks. Consequently these approaches cannot explicitly and effectively use the correlations between adjacent frames to help infer the parameters of the target frames. In this paper, we propose a novel framework that can measure the correlations amongst adjacent frames in the form of an estimated confidence metric. The confidence value will indicate to what extent the adjacent frames can help predict the target frames’ 3D shapes and poses. Based on the estimated confidence values, temporally aggregated features are then obtained by adaptively allocating different weights to the temporal predicted features from the adjacent frames. The final 3D shapes and poses are estimated by regressing from the temporally aggregated features. Experimental results on three benchmark datasets show that the proposed method outperforms state-of-the-art approaches (even without the motion priors involved in training). In particular, the proposed method is more robust against corrupted frames.

Abstract:
Nowadays it has still remained as a big challenge to efficiently compress color images in the encrypted domain. In this paper we present a novel deep-learning-based approach to encryption-then-lossy-compression (ETC) of color images by incorporating the domain knowledge of the encrypted image reconstruction process. In specific, a simple yet effective uniform down-sampling is utilized for lossy compression of images encrypted with a modulo-256 addition, and the task of image reconstruction from an encrypted down-sampled image is then formulated as a problem of constrained super-resolution (SR) reconstruction. A customized residual dense spatial network (RDSN) is proposed to solve the formulated constrained SR task by taking advantage of spatial attention mechanism (SAM), global skip connection (GSC), and uniform down-sampling constraint (UDC) that is specific to an ETC system. Extensive experimental results show that the proposed ETC scheme achieves significant performance improvement compared with other state-of-the-art ETC methods, indicating the feasibility and effectiveness of the proposed deep-learning based ETC scheme.

Abstract:
How to recommend outfits has gained considerable attention in both academia and industry in recent years. Many studies have been carried out regarding fashion compatibility learning, to determine whether the fashion items in an outfit are compatible or not. These methods mainly focus on evaluating the compatibility of existing outfits and rarely consider applying such knowledge to ‘design’ new fashion items. We propose the new task of generating complementary and compatible fashion items based on an arbitrary number of given fashion items. In particular, given some fashion items that can make up an outfit, the aim of this paper is to synthesize photo-realistic images of other, complementary, fashion items that are compatible with the given ones. To achieve this, we propose an outfit generation framework, referred to as COutfitGAN, which includes a pyramid style extractor, an outfit generator, a UNet-based real/fake discriminator, and a collocation discriminator. To train and evaluate this framework, we collected a large-scale fashion outfit dataset with over 200 K outfits and 800 K fashion items from the Internet. Extensive experiments show that COutfitGAN outperforms other baselines in terms of similarity, authenticity, and compatibility measurements.

Abstract:
Filter pruning is a technique that reduces computational complexity, inference time, and memory footprint by removing unnecessary filters in convolutional neural networks (CNNs) with an acceptable drop in accuracy, consequently accelerating the network. Unlike traditional filter pruning methods utilizing zeroing-out filters, we propose two techniques to achieve the effect of pruning more filters with less performance degradation, inspired by the existing research on centripetal stochastic gradient descent (C-SGD), wherein the filters are removed only when the ones that need to be pruned have the same value. First, to minimize the negative effect of centripetal vectors that gradually make filters come closer to each other, we redesign the vectors by considering the effect of each vector on the loss-function using the Taylor-based method. Second, we propose an adaptive gradient learning (AGL) technique that updates weights while adaptively changing the gradients. Through AGL, performance degradation can be mitigated because some gradients maintain their original direction, and AGL also minimizes the accuracy loss by perfectly converging the filters, which require pruning, to a single point. Finally, we demonstrate the superiority of the proposed method on various datasets and networks. In particular, on the ILSVRC-2012 dataset, our method removed 52.09% FLOPs with a negligible 0.15% top-1 accuracy drop on ResNet-50. As a result, we achieve the most outstanding performance compared to those reported in previous studies in terms of the trade-off between accuracy and computational complexity.

Abstract:
Recent years have witnessed the explosion of virtual reality (VR) videos and applications. This new form of media grants us the precious freedom we never had before to look at any directions of the video content. With such privilege, our desires for higher video resolution and frame-rate, better visual quality, and smoother watching experience have remarkably risen. However, VR videos occupy an astronomical amount of data, which poses unprecedented challenges to computation efficiency, deployment cost, and user experience of the system. In this paper, we propose RealVR, an end-to-end tile-based VR video system that faces and tackles such challenges. With all essential procedures from the initial capturing to the ultimate rendering included, the system is designed, implemented, and configured specifically to achieve the best balance among efficiency, economy, and quality-of-experience (QoE). It leverages the promising international VR standard, MPEG OMAF, to process and deliver 8K 60 fps VR video, but is also improved for much better performance and user experience than the pristine standard, especially for reducing the motion-to-high-quality (MtHQ) latency. It does not rely on additional expensive edge servers to offer an immersive user experience, which makes it more economical than many other works. Through extensive experiments, it is proven that RealVR can significantly improve the MtHQ experience and save bandwidth consumption without compromising on encoding efficiency or application cost, maximizing the user‘s freedom.

Abstract:
In this paper, a general reversible data hiding (RDH) framework for joint photographic experts group (JPEG) images with multiple two dimensional histograms (2DHs) is proposed. Regardless of whether zero alternating current (AC) coefficients are included to join data embedding or only non-zero AC coefficients are applied, the performance in terms of visual quality and file size increment is improved by using the proposed framework. This framework is mainly composed of the following three parts: histogram generation, adaptive 2DH mapping selection, and improved discrete particle swarm optimization (IDPSO). Unlike existing 2DH-based JPEG RDH methods, in which a uniform threshold is utilized to construct multiple histograms, in histogram generation, thresholds for different histograms are adaptively assigned according to the local properties of histogram coefficients. As a result, as many coefficients in complex regions as possible are excluded from the construction of each histogram. We subtly design multiple 2DH mappings, and adaptively select 2DH mappings for different 2DHs based on their distribution characteristics. Through slight adjustments, each 2DH mapping can be employed in cases where either zero AC coefficients or only non-zero AC coefficients are used for data embedding. Adaptive threshold and 2DH mapping selection provide a better image quality at a given embedding capacity but inevitably cause considerable complexity cost. To significantly reduce the computational cost, we propose IDPSO by combining differential evolution. IDPSO has the advantages of rapid convergence speed as well as satisfactory qualities of the best solutions. With the help of differential evolution, IDPSO expands the diversity of particles and efficiently avoids local optimal trapping problems. The experimental results also demonstrate the effectiveness of the proposed method in terms of visual quality, file size increment and complexity cost.

Abstract:
Research in texture feature approximation is still in the embryonic stage because of difficulties in developing a sound theoretical model to express the unique pattern in the intensity-variation of pixels in the neighbourhood of the pixel-of-interest so that it can sufficiently discriminate different textures. Local texture descriptors are widely used in image segmentation as they comprise pixel-wise features. The Weber local descriptor (WLD) with differential excitation and gradient orientation components, inspired by Weber's Law, has been leveraged in the state-of-the-art iterative contraction and merging (ICM) image segmentation technique. However, WLD has inherent drawbacks in the formulation of the components that limit its discriminatory capability. This paper introduces a novel texture descriptor by directly modelling the distribution of intensity-variation in the parametric space of the Weibull distribution using its shape and scale parameters. A unified ‘joint scale’ texture property is introduced, which can discriminate textures better than the individual parameters while keeping the length of the descriptor shorter. Additionally, the accuracy of WLD's gradient orientation component is improved by using an extended Sobel operator and expressing gradients in [-\pi /2,\pi /2) range. When incorporated in ICM, the proposed texture descriptor has consistently outperformed WLD and a recent enhancement with radial mean WLD (RM-WLD) on three benchmark datasets. It has also outperformed two other texture segmentation techniques and their deep learning based improvements.

Abstract:
Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.

Abstract:
Parkinson's disease (PD) is a neurodegenerative disease with a high incidence rate. Effective early diagnosis of PD is critical to prevent further deterioration of a patient's condition, where gait abnormalities are important factors for doctors to diagnose PD. Deep learning (DL)-based methods for PD detection using gait information recorded by non-invasive sensors have emerged to assist doctors in accurate and efficient disease diagnosis. However, most existing DL-based PD detection models neglect information in the frequency domain and do not adaptively model the correlation of signals among sensors. Moreover, different people have different gait patterns. Therefore, the generalization capabilities of PD detection models on diversities of individuals' gaits are essential. This work proposes a novel robust frequency-domain-based graph adaptive network (RFdGAD) for PD detection from gait information (i.e., vertical ground reaction force signals recorded by foot sensors). Specifically, the RFdGAD first learns the frequency-domain features of signals from each foot sensor by a frequency representation learning block. Then, the RFdGAD utilizes a graph adaptive network block taking frequency-domain features as input to adaptively learn and exploit the interconnection between different sensor signals for accurate PD detection. Moreover, the RFdGAD is trained by minimizing the proposed Jensen-Shannon divergence-based localized generalization error to improve the generalization performance of RFdGAD on unseen subjects. Experimental results show that the RFdGAD outperforms existing DL-based models for PD detection on three widely used datasets in terms of three metrics, including accuracy, F1-score, and geometric mean.

Abstract:
In recent years, joint detection and embedding (JDE) has become the research focus in multi-object tracking (MOT) due to its fast inference speed. JDE models are designed and widely utilized to train the detection task and the re-identification (Re-ID) task jointly. However, there exists a severe issue overlooked by previous JDE models, i.e., the detection task requires category-level features but the Re-ID task requires instance-level features. This could lead to feature conflict, which would hurt the performance of JDE models. Furthermore, inaccurate detection results can degrade the final tracking accuracy even when discriminative Re-ID features are provided. In this article, we propose a new balancing method for training JDE models, which monitors the training process of the detection task and adjusts the weights of the detection task and Re-ID task in the training phase. Our proposed balancing method ensures a well-trained detection model and a good trade-off between the detection task and Re-ID task. Comprehensive experiments on two public MOT benchmarks demonstrate the effectiveness and superiority of our proposed balancing method. In particular, our proposed balancing method could achieve new state-of-the-art results on MOT challenges without additional training data.

Abstract:
With the explosive increase of multimodal data, cross-modal correlation classification has become an important research topic and is in great demand in many cross-modal applications. A variety of classification schemes and predictive models have been built based on the existing cross-modal correlation categorization. However, these classification schemes typically follow the prior assumption that the paired cross-modal samples are strictly related, and thus pay great attention to the fine-grained relevant types of cross-modal correlation, ignoring the high volume of implicitly relevant data which are often wrongly classified into irrelevant types. Even more, previous predictive models fall short of reflecting the essence of cross-modal correlation according to their definitions, especially in the modeling of network structure. Thus in this paper, by comprehensively investigating the current image-text correlation classification research, we redefine a new classification scheme for cross-modal correlation based on the implicit and explicit relevance. To predict the types of image-text correlation based on our proposed definition, we further devise the Association and Alignment Network (namely AnANet) to model the implicit and explicit relevance, which captures both the implicit association of global discrepancy and commonality between image and text and explicit alignment of cross-modal local relevance. Experimental studies on our constructed new image-text correlation dataset verify the effectiveness of our proposed model.

Abstract:
Cross domain adaptation aims to improve the performance of the target domain model by making full use of information rich source domain samples. However, as information becomes richer, the noise also increases. In order to improve the reliability of cross domain adaptation, we propose a novel method based on deep robust low rank correlation. Borrowed from the traditional idea of Canonical Correlation Analysis (CCA), we developed a robust correlation model to maximize the correlation between source and target domains. Also, the low-rank characteristics of cross domain data can effectively reduce the negative influence of noisy data. Furthermore, in order that the cross-domain data can share a unifying clustering structure, we introduced a common Laplacian affinity structure. Then the learned features can be smoothed and aligned to the unifying structure. In this way, we obtain a deep robust low rank correlation model with the help of the unifying clustering structure, which can effectively reduce the influence of noise and improve the performance of cross domain adaptation. Experimental results on three datasets including Office-31, ImageCLEF-DA and Office-Home show that our model significantly outperforms state-of-the-art cross domain adaptation methods.

Abstract:
View-based methods have achieved state-of-the-art performance in 3D object retrieval. However, view-based methods still encounter two major challenges. The first is how to leverage the inter-view correlation to enhance view-level visual features. The second is how to effectively fuse view-level features into a discriminative global descriptor. Towards these two challenges, we propose a multi-range view aggregation network (MRVA-Net) with a vision transformer based feature fusion scheme for 3D object retrieval. Unlike the existing methods which only consider aggregating neighboring or adjacent views which could bring in redundant information, we propose a multi-range view aggregation module to enhance individual view representations through view aggregation beyond only neighboring views but also incorporate the views at different ranges. Furthermore, to generate the global descriptor from view-level features, we propose to employ the multi-head self-attention mechanism introduced by vision transformer to fuse the view-level features. Extensive experiments conducted on three public datasets including ModelNet40, ShapeNet Core55 and MCB-A demonstrate the superiority of the proposed network over the state-of-the-art methods in 3D object retrieval.

Abstract:
String prediction (SP) is a very efficient screen content coding (SCC) tool which has been adopted in the third generation of Audio Video Standard (AVS3). It is observed that two special types of strings occur frequently. To further improve the coding efficiency for SCC on top of the original SP, a new variation of SP named Equal-value-string and Copy-above-string based SP (ECSP) is proposed. An ECSP coding unit uses only three types of strings: Equal-value-string, Copy-above-string, and Unpredictable-pixel-string. Compared with the AVS3 reference software HPM9.0 with ECSP disabled, using AVS3 SCC Common Test Condition and YUV 4:2:0 test sequences, the proposed technique achieves an average Y BD-rate reduction of 5.54 and 3.01% for All Intra and Low Delay configurations, respectively, with low additional encoding and decoding complexity. The proposed ECSP has been adopted in the AVS3 standard.

Abstract:
The visual quality of a single image captured by a digital camera usually suffers from limited spatial resolution and low dynamic range (LDR) due to sensor constraints. To address these problems, recent works have independently applied convolutional neural networks (CNNs) to super-resolution (SR) and high dynamic range (HDR) imaging and made significant improvements in visual quality. However, directly connecting SR and HDR networks is an inefficient way to enhance image quality, because these two tasks share most of the same processing steps. To this end, we propose a deep neural network for the joint task of SR and HDR imaging, termed Deep SR-HDR, which reconstructs a high-resolution (HR) HDR image from a set of differently exposed low-resolution (LR) LDR images of a dynamic scene. Specifically, we merge the shared processing steps, including feature extraction and alignment of these two tasks. In particular, to handle large-scale complex motions, we design a multi-scale deformable module (MSDM) that estimates the sampling location offsets in a coarse-to-fine manner and then flexibly integrates useful information to compensate for the missing content in the motion regions. Then, we divide the fusion stage into two branches for HDR generation and high-frequency information extraction. With the cooperation and interactions of these modules, the proposed network reconstructs high-quality HR HDR images. Extensive qualitative and quantitative experimental results demonstrate the superiority and high efficiency of the proposed network.

Abstract:
Semi-Supervised Learning (SSL) with mismatched classes deals with the problem that the classes-of-interests in the limited labeled data are only a subset of the classes in massive unlabeled data. As a result, classical SSL methods would be misled by the classes which are only possessed by the unlabeled data. To solve this problem, some recent methods divide unlabeled data to useful in-distribution (ID) data and harmful out-of-distribution (OOD) data, among which the latter should particularly be weakened. As a result, the potential value contained by OOD data is largely overlooked. To remedy this defect, this paper proposes a “Transferable OOD data Recycling” (TOOR) method which properly utilizes ID data as well as the “recyclable” OOD data to enrich the information for conducting class-mismatched SSL. Specifically, TOOR treats the OOD data that have a close relationship with ID data and labeled data as recyclable, and employs adversarial domain adaptation to project them to the space of ID data and labeled data. In other words, the recyclability of an OOD datum is evaluated by its transferability, and the recyclable OOD data are transferred so that they are compatible with the distribution of known classes-of-interests. Consequently, our TOOR extracts more information from unlabeled data than existing methods, so it achieves an improved performance which is demonstrated by the experiments on typical benchmark datasets.

Abstract:
User preference music transfer (UPMT) is a new problem in music style transfer that can be applied to many scenarios but remains understudied. Transferring an arbitrary song to fit a user’s preferences increases musical diversity and improves user engagement, which can greatly benefit individuals’ mental health. Most music style transfer approaches rely on data-driven methods. In general, however, constructing a large training dataset is challenging because users can rarely provide enough of their favorite songs. To address this problem, this paper proposes a novel hybrid method called User Preference Transformer (UP-Transformer) which uses prior knowledge of only one piece of a user’s favorite music. Based on the distribution of music events in the provided music, we propose a new favorite-aware loss function to fine-tune the Transformer-based model. Two steps are proposed in the transfer phase to achieve UPMT based on the extracted music pattern in a user’s favorite music. Additionally, to alleviate the problem of evaluating melodic similarity in music style transfer, we propose a new concept called pattern similarity (PS) to measure the similarity between two pieces of music. Statistical tests indicate that the results of PS are consistent with the similarity score in a qualitative experiment. Furthermore, experimental results on subjects show that the transferred music achieves better performance in musicality, similarity, and user preferences.

Abstract:
Recently, most dehazed image quality assessment (DQA) methods have focused on estimating remaining haze and omitting distortion impact from the side effect of dehazing algorithms, which leads to their limited performance. Addressing this problem, we propose a method for learning both visibility and distortion-aware features no-reference (NR) dehazed image quality assessment (VDA-DQA). Visibility-aware features are exploited to characterize clarity optimization after dehazing, including the brightness-, contrast-, and sharpness-aware features extracted by the complex contourlet transform (CCT). Then, distortion-aware features are employed to measure the distortion artifacts of images, including the normalized histogram of the local binary pattern (LBP) from the reconstructed dehazed image and the statistics of the CCT subbands corresponding to the chroma and saturation map. Finally, all the above features are mapped into quality scores by support vector regression (SVR). Extensive experimental results on six public DQA datasets verify the superiority of the proposed VDA-DQA method in terms of consistency with subjective visual perception and outperform state-of-the-art methods.

Abstract:
Depth provides complementary information for salient object detection (SOD). However, the performance of RGB-D SOD methods is usually hindered by low quality depth map, semantic gap cross-modality and intrinsic gap between multi-level features. Although recent RGB-D SOD methods have been embedded into depth quality assessment, these methods do not consider the inconsistency of the depth format across datasets. In this paper, we propose an interpretable and effective mechanism called interference degree (ID) to assess depth quality and reweight the contribution of single-modality features without extra annotation. Then, a cross-modality interaction block (CMIB) is designed to reduce the semantic gap between RGB and depth features with the help of ID mechanism, and a mutually guided cross-level fusion (MGCF) module is designed to reduce the intrinsic gap among multi-level features. Finally, a refinement branch is proposed to enhance the salient regions and suppress the non-salient regions of fused features. Extensive experiments on six benchmark datasets show that the proposed depth-induced gap-reducing network (DIGR-Net) outperforms 20 recent state-of-the-art methods.

Abstract:
Modeling a sequence of video frames as a linear subspace on Grassmann manifold has recently become increasingly attractive in multiple computer vision applications. The success of such algorithms largely depends on a good distance measure, and learning an appropriate metric on Grassmann manifold remains a key challenge. Existing works address this by learning a discriminative mapping from the original Grassmann manifold to Hilbert space or a lower-dimensional, more discriminative Grassmann manifold. However, these approaches always highly rely on nearest neighbor matching of samples on the projected space, which is sensitive to noises and errors. Different from them, this paper proposes a Grassmann Reconstruction Metric Learning (GRML) algorithm guided by sparse representation-based classifier (SRC) for image set classification. SRC selects the coefficients associated with each class to reconstruct training samples, and then we employ it as a criterion to direct the design of a discriminant metric on Grassmann manifold. Specifically, GRML attempts to jointly maximize the inter-class reconstruction residual and minimize the intra-class reconstruction residual in the lower but more discriminative Grassmann manifold. To further explore the intrinsic geometry distance, we present a Grassmann Reconstruction Multiple Kernel Metric Learning (GRMKML) algorithm, which aims to jointly learn a metric and the corresponding kernel from a family of kernels for Grassmann manifold. Extensive experiments on eight benchmark datasets demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.

Abstract:
Prohibited item segmentation has a wide range of applications in the security check field, such as computer-aided screening, threat image projection and material discrimination. However, the severe object overlapping in X-ray baggage images restricts the performance of common CNN-based segmentation methods greatly. Worse, no public dataset can be used to promote research in this challenging and promising area. In this paper, to cope with these problems, we present the first Prohibited Item X-ray segmentation dataset named PIXray. PIXray comprises 5,046 X-ray images, in which 15 classes of 15,201 prohibited items are annotated as instance-level masks. Besides, we contribute a dense de-overlap attention snake (DDoAS) in the context of deep learning for automated and real-time prohibited item segmentation. DDoAS mainly includes a dense de-overlap module (DDoM) and an attention deforming module (ADM). Specifically, DDoM is designed to infer prohibited item information accurately from extreme background overlaps through dense reversed connections. ADM aims to improve the low learning efficiency introduced by large variations in shapes and sizes among different prohibited items. Comprehensive evaluation on the PIXray shows the effectiveness and superiority of DDoM and ADM. DDoM excels at recognizing prohibited items from complex backgrounds than other in-domain methods and achieves consistent performance gain over various network backbones, extending the idea of tackling overlapping images data. ADM can ease the model training and further refine the mask quality. Furthermore, out-of-domain experiments prove that DDoAS can also be applied to natural images and achieves comparable performance to the state-of-the-art methods, which implies its potential applications in other fields. The dataset and source code are available at https://github.com/Mbwslib/DDoAS.

Abstract:
Prohibited items inspection using X-ray screening is essential for reducing the risk of crime and terrorist attacks. The difficulty in prohibited items inspection lies in accurately detecting prohibited items in complex X-ray images and limited access to X-ray images containing prohibited items. Few-shot segmentation aims at learning with limited examples and assigning a category label to each image pixel. However, current few-shot methods are mostly full-supervised and less robust to the prohibited items categories that did not appear during training process. In this paper, we propose a method for few-shot prohibited items segmentation tasks which utilize unlabeled data and better leverage the representation of input samples during model training process. Specifically, a patch-based self-supervised embedding network is firstly devised as the base learner to learn an abstract representation of the observation from unlabeled samples. Then we apply few-shot learning and generate abstract representation related to prohibited items from support sample within the embedding space, which is followed by obtaining the corresponding class-specific prototype representations via masked average pooling. The distance between each pixel of query sample and prototypes are calculated to predict the label of each pixel. Moreover, prototype reverse validation strategy (PRV) is proposed to further exploit the support representation to assist training. Extensive experiments show that our proposed method outperforms the state-of-the-art by delivering a higher accuracy on automated prohibited items inspection and requiring less labeled samples.

Abstract:
The high dynamic range (HDR) image recovery from the low dynamic range (LDR) image aims to estimate HDR image by decompressing luminance range and enhancing details of the LDR input. In practical usages, when faced with the over-exposed, the under-exposed or the low-light images, the state-of-art prediction methods lack the capability for ideally handling them. Aiming for this, a light adaptation HDR recovery framework (LA-HDR) is proposed, which includes the multi-images generation for adaptive details amplification in different light ranges, and the following multi-details fusion. To create the multi-images, first, the designed bit-depth enhancement network (EnhanceNet) produces the high bit-depth result with enhanced contrast. This result can be furtherly processed by user-defined denoising method to refrain the low-light noise. Meanwhile, the proposed exposure bias network (EBNet) estimates the global exposure bias of the input for rectifying the mid-range details. With the enhanced result and the exposure bias, the designed transfer functions adaptively create three multi-images containing the enhanced details in different light ranges, and they are fused by the designed multi-images fusion network (FuseNet) for the final HDR prediction. The amplification and fusion scheme ensures robust HDR recovery under different light conditions, eliminating high-light recovery artifacts from previous methods. The proposed fusion masks generation (FMG) and the global feature embedding (GFE) modules in FuseNet help eliminate the fusion artifacts. Experimental results show that LA-HDR acquires the best average performance under various light conditions, and it receives low influence from the input light conditions among the tested state-of-art HDR recovery methods.

Abstract:
The exponential demand for multimedia services is one reason behind the substantial growth of mobile data traffic. Video traffic patterns have significantly changed in the past two years due to the coronavirus disease (COVID-19). The worldwide pandemic has caused many individuals to work from home and use various online video platforms (e.g., Zoom, Google Meet, and Microsoft Teams). As a result, overloaded macrocells are unable to ensure high Quality of Experience (QoE) to all users. Heterogeneous Networks (HetNets) consisting of small cells (femtocells) and macrocells are a promising solution to mitigate this problem. A critical challenge with the deployment of femtocells in HetNets is the interference management between Macro Base Stations (MBSs), Femto Base Stations (FBSs), and between FBS and FBS. Indeed, the dynamic deployment of femtocells can lead to co-tier interference. With the rolling out of the 5G mobile network, it becomes imperative for mobile operators to maintain network capacity and manage different types of interference. Machine Learning (ML) is considered a promising solution to many challenges in 5G HetNets. In this paper, we propose a Machine Learning Interference Classification and Offloading Scheme (MLICOS) to address the problem of co-tier interference between femtocells for video delivery. Two versions of MLICOS, namely, MLICOS1 and MLICOS2, are proposed. The former uses conventional ML classifiers while the latter employs advanced ML algorithms. Both versions of MLICOS are compared with the classic Proportional Fair (PF) scheduling algorithm, Variable Radius and Proportional Fair scheduling (VR + PF) algorithm, and a Cognitive Approach (CA). The ML models are assessed based on the prediction accuracy, precision, recall and F-measure. Simulation results show that MLICOS outperforms the other schemes by providing the highest throughput and the lowest delay and packet loss ratio. A statistical analysis was also carried out to depict the degree of interference faced by users when different schemes are employed.

Abstract:
It has been well recognized that fusing the complementary information from depth-aware LiDAR point clouds and semantic-rich stereo images would benefit 3D object detection. Nevertheless, it is non-trivial to explore the inherently unnatural interaction between sparse 3D points and dense 2D pixels. To ease this difficulty, the recent approaches generally project the 3D points onto the 2D image plane to sample the image data and then aggregate the data at the points. However, these approaches often suffer from the mismatch between the resolution of point clouds and RGB images, leading to sub-optimal performance. Specifically, taking the sparse points as the multi-modal data aggregation locations causes severe information loss for high-resolution images, which in turn undermines the effectiveness of multi-sensor fusion. In this paper, we present VPFNet —a new architecture that cleverly aligns and aggregates the point cloud and image data at the “virtual” points. Particularly, with their density lying between that of the 3D points and 2D pixels, the virtual points can nicely bridge the resolution gap between the two sensors, and thus preserve more information for processing. Moreover, we also investigate the data augmentation techniques that can be applied to both point clouds and RGB images, as the data augmentation has made non-negligible contribution towards 3D object detectors to date. We have conducted extensive experiments on KITTI dataset, and have observed good performance compared to the state-of-the-art methods. Remarkably, our VPFNet achieves 83.21% moderate AP_3D and 91.86% moderate AP_BEV on the KITTI test set. The network design also takes computation efficiency into consideration – we can achieve a FPS of 15 on a single NVIDIA RTX 2080Ti GPU.

Abstract:
Making machines endowed with eyes and brains to effectively understand and analyze crowd scenes is of paramount importance for building a smart city to serve people. This is of far-reaching significance for the guidance of dense crowds and accident prevention, such as crowding and stampedes. As a typical multimodal scene understanding task, image captioning has always attracted widespread attention. However, crowd scene understanding captioning is rarely studied due to the unobtainability of related datasets. Therefore, it is difficult to know what happens in crowd scenes. In order to fill this research gap, we propose a crowd scenes caption dataset named CrowdCaption which has the advantages of crowd-topic scenes, comprehensive and complex caption descriptions, typical relationships and detailed grounding annotations. The complexity and diversity of the descriptions and the specificity of the crowd scenes make this dataset extremely challenging to most current methods. Thus, we propose a Multi-hierarchical Attribute Guided Crowd Caption Network (MAGC) based on crowd objects, actions, and status (such as position, dress, posture, etc.) aiming to generate crowd-specific detailed descriptions. We conduct extensive experiments on our CrowdCaption dataset, and our proposed method reaches the state-of-the-art (SoTA) performance. We hope the CrowdCaption dataset can assist future studies related to crowd scenes in the multimodal domain.

Abstract:
Challenging motion, which tends to cause artifacts, is a key problem in the video denoising task. Recent video denoising methods have attempted to address this problem. However, they usually provide general performance evaluation on the overall dataset and cannot provide a comprehensive analysis for the influence of different motion levels. Thus, we questioned whether these methods can effectively deal with different scene motions. To this end, we synthesize a dataset containing videos with different motion levels and capture a new dataset that consists of videos involving large-scale motion. Then, we provide a comprehensive analysis on the elaborately collected datasets and find that, as the motion level increases, the performance of the denoising models based on implicit motion estimation (IME) declines sharply, while explicit motion estimation (EME) contributes to a more robust denoising quality. Therefore, in this work, we present an EME-embedded progressive denoising framework that fully considers the relationship between the noise removal and motion estimation. Specifically, we decouple video denoising into spatial denoising, EME-based frame reconstruction, and temporal refining processes. Spatial denoising improves the accuracy of EME process in the case of videos suffering from heavy noise, while the temporal refining process refines the denoised frame by utilizing temporal redundancy of the reconstructed motion-free frames. Extensive experiments demonstrate that the proposed method outperforms existing state-of-the-art methods, especially for videos containing large-scale motion.

Abstract:
The emerging technologies of Virtual Reality (VR) and 360^\circ video introduce new challenges for state-of-the-art video communication systems. Enormous data volume and spatial user navigation are unique characteristics of 360^\circ videos that necessitate a space-time effective allocation of the available network streaming bandwidth over the 360^\circ video content to maximize the Quality of Experience (QoE) delivered to the user. Towards this objective, we investigate a framework for viewport-driven rate-distortion optimized 360^\circ video streaming that integrates the user view navigation patterns and the spatiotemporal rate-distortion characteristics of the 360^\circ video content to maximize the delivered user viewport video quality, for the given network/system resources. The framework comprises a methodology for assigning dynamic navigation likelihoods over the 360^\circ video spatiotemporal panorama, induced by the user navigation patterns, an analysis and characterization of the 360^\circ video panorama's spatiotemporal rate-distortion characteristics that leverage preprocessed spatial tilling of the content, and an optimization problem formulation and solution that capture and aim to maximize the delivered expected viewport video quality, given a user's navigation patterns, the 360^\circ video encoding/streaming decisions, and the available system/network resources. We formulate a Markov model to capture the navigation patterns of a user over the 360^\circ video panorama and simultaneously extend our actual navigation datasets by synthesizing additional realistic navigation data. Moreover, we investigate the impact of using two different tile sizes for equirectangular tiling of the 360^\circ video panorama. Our experimental results demonstrate the advantages of our framework over the conventional approach of streaming a monolithic uniformly-encoded 360^\circ video and a state-of-the-art navigation-speed based reference method. Considerable average and instantaneous viewport video quality gains of up to 5 dB are demonstrated in the case of five popular 4 K 360^\circ videos. In addition, we explore the impact of two different popular 360^\circ video quality metrics applied to evaluate the streaming performance of our system framework and the two reference methods. Finally, we demonstrate that by exploiting the unequal rate-distortion characteristics of the different spatial sectors of the 360^\circ video panorama, we can enable spatially more uniform and temporally higher 360^\circ video viewport quality delivered to the user, relative to monolithic streaming.

Abstract:
Virtual standard patient (VSP) is in high demand for medical students' diagnosis ability training in an efficient manner. Different from the traditional conversation system in medical dialogue generation, VSP needs a novel conversation paradigm to act as the patient instead of the doctor. However, existing conversation techniques still have limited ability in terms of generation of symptoms exhibited by patients with the personalized and knowledge-centered expressions. To alleviate these problems, we propose to construct a novel oral knowledge graph, which sufficiently provides medical clues of the certain disease. Accordingly, the VSP could accurately interact with the dentists for their underlying intention and express the symptoms characters in a natural style. To efficiently retrieve the related disease clues, the symptoms descriptions of the oral diseases are encoded into the oral knowledge graph, which could well organize the disease-centered symptom entities and speaking styles. Moreover, to transfer the common sense knowledge from existing large scale of medical knowledge graph to the specific oral knowledge graph, a coupled pre-trained Bert models is further designed to learn the related medical knowledge from coarse-level to fine-level hierarchically. Finally, a series of well-designed personalized templates are proposed to generate plausible and realistic answers in condition of the certain disease. We also conduct extensive user studies to demonstrate that the VSP satisfies the medical students' diagnosis practice requirement in terms of naturalness, realism, and topic relevance.

Abstract:
In hyperspectral target detection, the conventional metric learning-based algorithms provide unique advantages in detecting targets as they do not require specific assumptions and adapt to the condition of limited training samples. Nevertheless, they usually learn a linear transformation for metric space, which is unable to capture nonlinear mapping where the hyperspectral imageries possess, especially occurs in the spectra variability and nonlinear mixing problems. To alleviate this limitation, this study investigates a new spatial-spectral adaptive sample generation and deep metric learning-based method for hyperspectral target detection (denoted as DMLTD). The proposed DMLTD employs a spatial-spectral adaptive sample generation strategy and subpixel synthetic method for background sample generation and target sample augmentation, respectively. With sufficient samples, the proposed DMLTD trains a deep discriminative metric learning network to learn hierarchical nonlinear mappings, so that to address the spectra variability and nonlinear mixing problems, thus exploiting discriminative information between targets and backgrounds for detection. Experiments and analyses conducted on three real-world hyperspectral datasets indicate that our DMLTD yields competitive performance in hyperspectral image target detection.

Abstract:
Object detection for aerial images has achieved remarkable progress in recent years. Nevertheless, most exiting studies do not differentiate oriented object detection from horizontal detection. Certain schemes ignore the ambiguity of oriented object representation and leverage label assignment designed for horizontal object detection directly. Consequently, it leads to unstable training and causes performance degradation, because high-quality samples surrounding the oriented bounding boxes can not be leveraged effectively. To address this problem, we propose a gliding Free, orientation Free, and anchor Free Network (Free\rm ^3Net) with high-efficiency for oriented object detection. Specifically, we propose an unambiguous oriented object representation scheme, named FreeGliding, by gliding the projection points of samples on each edge of horizontal bounding boxes. It makes the detection largely free from representation ambiguity and multi-task dependency. To overcome the restrictions of label assignment, we put forward a novel Loss-aware Outer Sample Selection (LOSS) scheme, which takes into consideration spatial information and localization capability to retain high-quality samples surrounding the objects. Moreover, we introduce an Oriented Feature Fusion (OFF) scheme to tackle feature alignment by adjusting the receptive field and fusing oriented features dynamically. Experimental results on two large-scale remote sensing datasets HRSC2016 and DOTA demonstrate that Free\rm ^3Net outperforms the state-of-the-art schemes with a large margin. We hope our work can inspire rethinking the design of anchor-free detectors, and serve as a strong baseline for oriented object detection.

Abstract:
Deep learning technologies have been applied in various computer vision tasks in recent years. However, deep models suffer performance decay when some unforeseen data are contained in the testing dataset. Although data enhancement techniques can alleviate this dilemma, the diversity of real data is too tremendous to simulate. To tackle this challenge, we study a scheme for improving the robustness and efficiency of the deep network training process in visual tasks. Specifically, first, we build positive and negative sample pairs based on a class-sensitive strategy. Then, we construct a feature-consistent learning strategy based on contrastive learning to constrain the representations of interclass features while paying attention to the intraclass features. To extend the effect of the consistent strategy, we propose a novel contrastive Jensen–Shannon divergence consistency loss (JS loss) to restrict the probability distributions of different sample pairs. The proposed scheme successfully enhances the robustness and accuracy of the utilized model. We validated our approach by conducting extensive experiments in the domains of model robustness and few-shot object detection (FSOD). The results showed that the proposed method achieved remarkable gains over state-of-the-art (SOTA) methods. We obtained a 3.2% average improvement over the best-performing FSOD method.

Abstract:
Cloud virtual reality (Cloud VR) services usually introduce high latency in rendering and streaming, resulting in a mismatch between the visual and vestibular systems, causing user sickness and dizziness during the service. To solve this problem, asynchronous rendering technology is usually used to provide smooth viewing. The asynchronous solution, on the other hand, will introduce another “black edge” (BE) artifact in the service, which frequently appears at the viewport's boundary with a black area when users turn their heads. This unwanted BE artifact also has an impact on the user's quality of experience (QoE). In this paper, we investigated the impact of the BE artifact on the user's QoE of the field of view (FOV) Cloud VR gaming services. The appearance of the BE artifact during the playing period was regarded as a series of BE events and the impact of BE artifact on the users’ QoE was evaluated by accumulating the influence of the BE events during the whole playing period. More specifically, the user's QoE affected by a single BE event was first evaluated by combining the area ratio and duration of the BE artifact. Then, the QoE affected by multiple BE events was analyzed, where the cumulative influence of the previous BE events on the user's current QoE was evaluated. Finally, a unified event-based evaluation model was proposed to predict the user's time-varying QoE at any point in time. Experimental results showed that the proposed model performed exceptionally well in predicting the impact of BE artifact on the user's QoE.

Abstract:
Text-to-image synthesis is a challenging problem, in which a complex scene contains diverse objects of various sizes and sub-images of objects belonging to the same class have diverse forms from different perspectives. Thus, synthesis models have difficulty in capturing varied objects in the complex scene. To alleviate these problems, we devise an independent object-level decomposing and enhancing generative adversarial networks, denoted as InDecGAN, to synthesize complex images and capture varied objects in a complex scene. Specifically, InDecGAN fully utilizes the independent object-level information, bounding boxes and high-resolution images of objects in training, by employing independent object-level pathways to synthesize varied objects. The independent object-level pathway integrates an independent object-level adversarial loss and the bounding box information to learn the visual features of objects independently, then, the main pathway exploits the features provided by the object-level pathway to compose the full scene and synthesize images. In addition, we analyze the generalization properties of the proposed InDecGAN and demonstrate the improvement from the perspective of the model architecture. Moreover, extensive experiments conducted on a widely used dataset are presented to demonstrate that the proposed model with an independent object-level pathway produces synthesized images of significantly improved quality.

Abstract:
Semantic Scene Completion (SSC) aims to reconstruct complete 3D scenes with precise voxel-wise semantics from the single-view incomplete input data, a crucial but highly challenging problem for scene understanding. Although SSC has seen significant progress due to the introduction of 2D semantic priors in recent years, the occluded parts, especially the rear-view of the scenes, are still poorly completed and segmented. To ameliorate this issue, we propose a novel deep learning framework for 3D SSC, named Planar Convolution and Attention-based Network (PCANet), to effectively extend high-precision predictions of the front-view surface to the rear-view occluded areas. Specifically, we decompose the traditional convolutional layer into three successive planar convolutions to form a Planar Convolution Residual (PCR) block, which maintains the planar features of the 3D scene. Afterward, the Planar Attention Module (PAM) is proposed to capture three different planar attentions and harvest the global context from the front surface to the rear occluded areas to improve the overall accuracy. Extensive experiments on the real NYU and NYUCAD datasets and the synthetic SUNCG-RGBD dataset demonstrate that our proposed framework can generate high-quality SSC results in both front and rear views and outperforms the state-of-the-art approaches trained in an end-to-end manner without additional data.

Abstract:
Reversible data hiding based on joint photographic experts group (JPEG) images has been extensively studied to enhance embedding performance in terms of visual quality and file size preservation at the desired payload. In this paper, an efficient adaptive RDH method for JPEG images with multiple two-dimensional (2D) histogram modification is proposed. Firstly, the proposed method proposes the block smoothness estimator and the band smoothness estimator, and then combines the two estimators to reduce the embedding distortion as much as possible at the desired payload. Instead of adopting a fixed 2D mapping or choosing one from several empirically-designed mappings for each 2D histogram, the proposed method designs an adaptive 2D mapping generation strategy to adaptively generate a large number of mappings with considering the local characteristics of histogram distribution. Since exhaustively searching for the optimal mapping achieving the highest embedding performance for each 2D histogram is time-consuming, an improved discrete particle swarm optimization is utilized in the proposed method to speed up the optimization process. Extensive experimental results also demonstrate the effectiveness of the proposed method in terms of visual quality and file size increment of the stego image.

Abstract:
Recent years have witnessed the great success of blind image quality assessment (BIQA) in various task-specific scenarios, which present invariable distortion types and evaluation criteria. However, due to the rigid structure and learning framework, they cannot apply to the cross-task BIQA scenario, where the distortion types and evaluation criteria keep changing in practical applications. This paper proposes a scalable incremental learning framework (SILF) that could sequentially conduct BIQA across multiple evaluation tasks with limited memory capacity. More specifically, we develop a dynamic parameter isolation strategy to sequentially update the task-specific parameter subsets, which are non-overlapped with each other. Each parameter subset is temporarily settled to Remember one evaluation preference toward its corresponding task, and the previously settled parameter subsets can be adaptively reused in the following BIQA to achieve better performance based on the task relevance. To suppress the unrestrained expansion of memory capacity in sequential tasks learning, we develop a scalable memory unit by gradually and selectively pruning unimportant neurons from previously settled parameter subsets, which enable us to Forget part of previous experiences and free the limited memory capacity for adapting to the emerging new tasks. Extensive experiments on eleven IQA datasets demonstrate that our proposed method significantly outperforms the other state-of-the-art methods in cross-task BIQA.

Abstract:
Video generation has achieved rapid progress benefiting from high-quality renderings provided by powerful image generators. We regard the video synthesis task as generating a sequence of images sharing the same contents but varying in motions. However, most previous video synthesis frameworks based on pre-trained image generators treat content and motion generation separately, leading to unrealistic generated videos. Therefore, we design a novel framework to build the motion space, aiming to achieve content consistency and fast convergence for video generation. We present MotionVideoGAN, a novel video generator synthesizing videos based on the motion space learned by pre-trained image pair generators. Firstly, we propose an image pair generator named MotionStyleGAN to generate image pairs sharing the same contents and producing various motions. Then we manage to acquire motion codes to edit one image in the generated image pairs and keep the other unchanged. The motion codes help us edit images within the motion space since the edited image shares the same contents with the other unchanged one in image pairs. Finally, we introduce a latent code generator to produce latent code sequences using motion codes for video generation. Our approach achieves state-of-the-art performance on the most complex video dataset ever used for unconditional video generation evaluation, UCF101.

Abstract:
Self-supervised learning methods for 3D skeleton-based action recognition via contrastive learning have obtained competitive achievements compared to classical supervised methods. Current researches show that adding a Multilayer Perceptron (MLP) to the top of the base encoder can extract high-level and global positive representations. Using a negative memory bank to store negative samples dynamically can balance the ample storage and feature consistency. However, these methods need to consider that the MLP lacks accurate encoding of fine-grained local features, and a memory bank needs rich and diverse negative sample pairs to match positive representations from different encoders. This paper proposes a new method called Cross Momentum Contrast (CrossMoCo), composed of three parts: ST-GCN encoder, ST-GCN encoder with MLP encoder (ST-MLP encoder), and two independent negative memory banks. The two encoders encode the input data into two positive feature pairs. Learning the cross representations of the two positive pairs is helpful for the model to extract both the global and the local information. Two independent negative memory banks update the negative samples according to different positive representations from two encoders, diversifying the negative samples' distribution and making negative representations close to the positive features. The increasing classification difficulty will improve the model's ability of contrastive learning. In addition, the spatiotemporal occlusion mask data augmentation method is used to enhance positive samples' information diversity. This method takes the adjacent skeleton joints that can form a skeleton bone as a mask unit, which can reduce the information redundancy after data augmentation since adjacent joints may carry similar spatiotemporal information. Experiments on the PKU-MMD Part II dataset, the NTU RGB+D 60 dataset, and the NW-UCLA dataset show that the CrossMoCo framework with spatiotemporal occlusion mask data augmentation has achieved a comparable performance.

Abstract:
Long term visual localization has to conquer the problem of matching images with dramatic photometric changes caused by different seasons, natural and man-made illumination changes, etc. Visual localization at night plays a vital role in many applications like autonomous driving and augmented reality, for which extracting keypoints and descriptors with robustness to day-night illumination changes has became the bottleneck. This paper proposes an adversarial learning based solution to harvest from the weakly domain labels of day and night images, along with the point level correspondences among day time images, to achieve robust local feature extraction and description across day-night images. The key idea is to learn a discriminator to distinguish whether a feature map is generated from the day or night images, and simultaneously to adjust the parameters of feature extraction network so as to fool the discriminator. After adversarial training of the discriminator and feature extraction network, the feature extraction network finally reaches a stable status so that the extracted feature maps are robust to day-night photometric changes, based on which day-night domain invariant keypoints and descriptors can be extracted. Compared to existing local feature learning methods, it only requires an additional set of easily captured night images to improve the domain invariance of learned features. Experiments on two challenging benchmarks show the effectiveness of proposed method. In addition, this paper revisits the widely used image matching metrics on HPatches and finds that recall of different methods is highly related to their relative localization performance.

Abstract:
For people who ardently love painting but unfortunately have visual impairments, holding a paintbrush to create a work is a very difficult task. People in this special group are eager to pick up the paintbrush, like Leonardo da Vinci, to create and make full use of their own talents. Therefore, to maximally bridge this gap, we propose a painting navigation system called “Angle’s Eyes” to assist blind people in artistic creation. The proposed system is composed of cognitive system and guidance system. The system adopts drawing board positioning based on QR code, brush navigation based on target detection and bush real-time positioning. Meanwhile, we design a simple yet efficient position information coding rule to remind the user of the current brush tip position. In addition, we design a criterion to efficiently judge whether the brush reaches the target or not. The numerous experiments are conducted to optimize and test the performance of the system. The results of real-world scenario experiments demonstrate that the developed system has great potential to help blind people with painting. This work also demonstrates that it is practicable for the blind people to feel the world through the brush in their hands. In the future, we plan to deploy “Angle’s Eyes” on the phone to make it more portable. The demo video of the proposed painting navigation system is available at https://doi.org/10.6084/m9.figshare.9760004.v1.

Abstract:
Video traffic has experienced an exponential increase in current years due to the growing ubiquity of mobile equipment and the constant network improvement. Most commercial players employ adaptive bitrate (ABR) algorithms to dynamically choose bitrate for each chunk based on perceived network capacity and buffer occupancy. Unluckily, even though improving the quality of chunks with dynamic scenes can achieve more QoE gain than static scenes, current ABR algorithms usually strive to maximize the average bitrate instead of perceptual quality, leading to the QoE degradation. To overcome this obstacle, we introduce a dynamic-chunk quality-aware adaptive bitrate algorithm through apprenticeship learning called DAVS (Dynamic-chunk quality Aware Video Streaming), where higher quality is selected for the dynamic chunks without reducing the quality of static chunks extravagantly. Furthermore, we take the user’s viewing preference into account to make DAVS adapt to the QoE diversity. The experimental results demonstrate that DAVS ameliorates the quality of dynamic chunks and significantly enhances the QoE compared with several representative ABR algorithms.

Abstract:
The cumbersome computation of deep neural networks (DNNs) limits their practical deployment on resource-constrained mobile multimedia devices. To deploy DNNs on devices with limited computing resources, model compression techniques are leveraged to accelerate the networks, where network pruning can improve the inference efficiency of DNNs by removing redundant weights and structures. As one of the important components of DNNs, the feature maps (FMs) can be leveraged to evaluate the importance of network structures for DNN pruning. However, previous methods neglect to fully explore the characteristics of FMs in network pruning. In this paper, we investigate the high capacity and resource efficient analogy-ventral dual-pathway primates visual system (PVS) to propose a hierarchical pruning framework (dubbed as HPSE). In an efficient PVS, the analog pathway analyzes low-frequency information to facilitate the high-frequency information inference in ventral stream. In HPSE, we extract the low-frequency shape information and high-frequency edge information from FMs to present a novel pruning pipeline that resembles the analysis mechanism of PVS. In particular, we first imitate the analogy pathway to group different FMs in each layer by calculating the shape-feature overlap. Secondly, we leverage the edge information modulated by the grouping results of the first step to prune the network. The effectiveness of HPSE is verified by pruning various DNNs on different benchmarks. For example, for ResNet-56 on CIFAR-10, HPSE reduces 52.9% of FLOPs with a slight accuracy improvement; for ResNet-50 on ImageNet, we achieve 54.3%-FLOPs drop with only 0.49% Top-1 accuracy loss.

Abstract:
Moving object detection is critical for automated video analysis in many vision-related tasks, such as surveillance tracking, video compression coding, etc. Robust Principal Component Analysis (RPCA), as one of the most popular moving object modelling methods, aims to separate the temporally-varying (i.e., moving) foreground objects from the static background in video, assuming the background frames to be low-rank while the foreground to be spatially sparse. Classic RPCA imposes sparsity of the foreground component using \ell _1-norm, and minimizes the modeling error via \ell _2-norm. We show that such assumptions can be too restrictive in practice, which limits the effectiveness of the classic RPCA, especially when processing videos with dynamic background, camera jitter, camouflaged moving object, etc. In this paper, we propose a novel RPCA-based model, called Hyper RPCA, to detect moving objects on the fly. Different from classic RPCA, the proposed Hyper RPCA jointly applies the maximum correntropy criterion (MCC) for the modeling error, and Laplacian scale mixture (LSM) model for foreground objects. Extensive experiments have been conducted, and the results demonstrate that the proposed Hyper RPCA has competitive performance for foreground detection to the state-of-the-art algorithms on several well-known benchmark datasets.

Abstract:
Deep convolutional neuralnetworks have achieved fairly high accuracy for single online handwritten Chinese character recognition (SOLHCCR). However, in real application scenarios, users always write multiple characters to form a complete sentence, and previous contextual information holds significant potential for improving the accuracy, robustness and efficiency of recognition. In this work, we first propose a simple and straightforward model named the vanilla compositional network (VCN) by coupling convolutional neural network with a sequence modeling architecture (i.e., a recurrent neural network or Transformer), which exploits the handwritten character’s previous contextual information. Although VCN performs much better than the previous state-of-the-art SOLHCCR models, it is a two-stage architecture in nature. It suffers from high fragility when confronting with poorly written characters such as sloppy writing, and missing or broken strokes, due to relying heavily on contextual information. To improve the robustness of the OLHCCR model, we further propose a novel deep spatial & contextual information fusion network (DSCIFN). It utilizes an autoregresssive framework pre-trained on a large-scale sentence corpora as the backbone component, and highly integrates the spatial features of handwritten characters and their previous contextual information in a multi-layer fusion module. To verify the effectiveness of models, we reorganize a new form of online Chinese handwritten character with its previous context dataset, named OHCCC. Extensive experimental results demonstrate that DSCIFN achieves state-of-the-art performance and has increased strong robustness compared to VCN and previous SOLHCCR models. The in-depth empirical analysis and case study indicate that DSCIFN can significantly improve the efficiency of handwriting input because it does not need complete strokes to recognize a handwritten Chinese character precisely.

Abstract:
Image dehazing is an important task since it is the prerequisite for many downstream high-level computer vision tasks. Previous dehazing methods depend on either the hand-designed priors/assumptions or supervised learning with plenty of data, which are not easy to implement in practice. Meanwhile, synthesizing hazy images is also significant in many scenes like multi-weather image generation. In this paper, we change the viewpoint of this task to image translation and develop a weakly supervised framework to achieve it. Instead of simply considering the hazy image as the source domain and the haze-free image as the target domain for translation, we design a feature representation scheme that generates a domain indicator, and embed it into the decoder to achieve both hazing and dehazing within one network. This design significantly reduces the complexity of network and can be more easily extended to multi-domain translation tasks than the previous methods, which need one pair of generator-discriminator for each direction of the translation. Meanwhile, aiming at solving the haze-relevant task, we design a haze attention module, which takes the local entropy map as the input. Unlike the previous weakly supervised dehazing methods, our approach only requires unpaired hazy and haze-free images rather than any intermediate supervising data like the transmission map or atmospheric light defined in the atmospheric scattering model. Experimental results on synthetic datasets show our method can achieve competitive results when compared with the state-of-the-art methods and yield more appealing dehazing and hazing results on real-world images.

Abstract:
Event cameras are bio-inspired cameras that can measure the intensity change asynchronously with high temporal resolution. One of the advantages of event cameras is that they suffer less from motion blur than traditional frame cameras when recording daily scenes with fast-moving objects. In this paper, we formulate the deblurring task on traditional cameras directed by events to be a residual learning one, and propose corresponding network architectures for effective learning of deblurring and high frame rate video generation tasks. We first train a modified U-Net network to restore a sharp image from a blurry image using the corresponding events. Then we train another similar network by replacing the downsampling blocks with blocks of the convolutional long short-term memory (Conv-LSTM) to recurrently generate high frame rate video using the restored sharp image and part of the events. Benefitting from the blur-free events and the proposed learning strategy, the experimental results show that the proposed method outperforms state-of-the-art methods for generating sharp images and high frame rate videos.

Abstract:
Subjective responses from Multimedia Quality Assessment (MQA) experiments are conventionally analyzed with methods not suitable for the data type these responses represent. Furthermore, obtaining subjective responses is resource intensive. Thus, a method that allows the reuse of existing responses would be beneficial. Applying improper data analysis methods leads to difficulty in interpreting results. This increases the probability of drawing erroneous conclusions. Building upon existing subjective responses is resource friendly and helps develop machine learning (ML) based visual quality predictors. In this work, we show that using a discrete model for analyzing responses from MQA subjective experiments is feasible. We indicate that our proposed Generalized Score Distribution (GSD) properly describes response distributions observed in typical MQA experiments. We also highlight interpretability of GSD parameters and indicate that the GSD outperforms the approach based on sample empirical distribution when it comes to bootstrapping. Furthermore, we provide evidence that the GSD outcompetes the state-of-the-art model both in terms of goodness-of-fit and bootstrapping capabilities. To accomplish the aforementioned objectives, we analyze more than one million subjective responses from over 30 subjective experiments.

Abstract:
Acoustic Scene Classification aims to recognize the unique acoustic characteristics of an environment. Recently, Convolutional Neural Networks (CNNs) have boosted the accuracy of ASC algorithms. However, the focus of ASC system designers has shifted from improving accuracy to incorporating real-world considerations like device robustness and model complexity. In this paper, we address the problem of developing a low complexity system for ASC which can generalize across multiple recording devices. We propose to employ residual quaternion CNNs for low complexity, device-robust ASC. The proposed model RQNet uses quaternion encoding to increase the accuracy with fewer parameters. To further enhance the performance of RQNet, we employ a variant of log-mel spectrogram called multi-scale mel spectrogram (ms2) to represent the acoustic signal. Experiments on two benchmark ASC datasets indicate that RQNet outperforms a log-mel spectrum-based baseline by more than twofold. In addition, it has a good measure of separability between the individual classes, as indicated by an AUC (Area Under the ROC Curve) scores of 0.906 and 0.994. Furthermore, it reduces the model size by 82.19% and floating-point operations by 23.25%. Consequently, RQNet is suitable for deployment in context-aware devices.

Abstract:
Although recent emotion recognition methods (based on facial expression cues) achieve excellent performance in controlled scenarios, the recognition of emotion in the wild remains a challenging problem because of occlusion, large head poses, illumination variations, etc. Recent advances in deep learning show that combining an ensemble of deep learning models can considerably outperform the approach of using only a single deep learning model for challenging recognition problems. This paper presents a novel ensemble deep learning method, “deep convolutional neural network (DCNN) ensemble classifier”, for improved facial expression recognition (FER) in the wild. Our proposed DCNN ensemble classifier is novel in terms of the following aspects: (1) the process of finding ensemble weights for combining DCNN decision outputs is formulated as a stochastic optimization problem (via simulated annealing) in which the energy to be minimized represents the generalized (test) classification error of the DCNN ensemble and (2) for the creation of DCNN ensemble members, we propose the combined use of different types of face representations and bagging (T. G. Dietterich, 2000), which is quite useful in increasing the diversity of the DCNN ensemble. Extensive and comparative experiments on three wild FER datasets, namely FER2013, SFEW2.0, and RAF-DB, show that the proposed DCNN ensemble classifier achieves competitive FER performances when compared with other recently developed methods—76.69%, 58.68%, and 87.13% of FER accuracy under the FER2013, SFEW2.0, and RAF-DB evaluation protocols, respectively.

Abstract:
Aiming to reduce the embedding distortion and improve tampering location precision of reversible watermarking for authenticating three-dimensional(3D) models, a semi-fragile reversible watermarking based on virtual polygon projection and double modulation strategy is proposed. During the embedding, it first constructs virtual adjacent vertices for each vertex and obtains a corresponding virtual polygon, and then a watermark is generated according to the projection value of the current vertex on the corresponding polygon. For each vertex, double modulation is used to move the vertex to realize watermark embedding. For the verification, it first obtains the vertex position and extracts the watermark, and then regenerates a watermark according to the restored vertex. If the extracted watermark is consistent with the regenerated one, it means that the vertex has not been tampered, and the 3D model can be lossless recovered; otherwise, the vertex is tampered. Experimental results and analysis show that the proposed scheme outperforms the existing methods in embedding distortion and tampering location precision. It has potential application in the integrity authentication of 3D models.

Abstract:
Occluded pedestrian detection is very challenging in computer vision, because the pedestrians are frequently occluded by various obstacles or persons, especially in crowded scenarios. In this article, an occluded pedestrian detection method is proposed under a basic DEtection TRansformer (DETR) framework. Firstly, Dynamic Deformable Convolution (DyDC) and Gaussian Projection Channel Attention (GPCA) mechanism are proposed and embedded into the low layer and high layer of ResNet50 respectively, to improve the representation capability of features. Secondly, Cascade Transformer Decoder (CTD) is proposed, which aims to generate high-score queries, avoiding the influence of low-score queries in the decoder stage, further improving the detection accuracy. The proposed method is verified on three challenging datasets, namely CrowdHuman, WiderPerson, and TJU-DHD-pedestrian. The experimental results show that, compared with the state-of-the-art methods, it can obtain a superior detection performance.

Abstract:
The traditional fashion industry is heavily dependent on designers whose talent and vision have a significant impact on their innovative designs. Through taking advantage of recent advances in image-to-image translation by generative adversarial networks (GANs), marked improvement in designers’ efficiency is now possible. Considering both randomness and controllability in the design process, this article presents a novel artificial intelligence (AI)-based framework for fashion design. Under this framework, a sketch-generation module which is based on latent space is firstly introduced for designing various sketches. Secondly, a rendering-generation module is proposed to learn mapping between textures and sketches to complete the task of fashion design. In order to achieve effectiveness in synthesizing semantic-aware textures on sketches, a multi-conditional feature interaction module is developed in the rendering-generation model. Moreover, two different training schemes are introduced to optimize both the sketch-generation module and the rendering-generation module. In order to evaluate the performance of our proposed models, we built a large-scale dataset which consists of 115,584 pairs of fashion item images. Experimental results demonstrate the effectiveness of our proposed method, and indicate that our model can facilitate designers’ design process by taking full advantage of the controllability of different conditions (e.g., sketch and texture) and the randomness of latent space.

Abstract:
This paper presents a new user experience for online apartment search using functionality and comfort as query items. Specifically, it has three technical contributions. First, we present a new dataset on the perceived functionality and comfort scores of residential floor plans using nine question statements about the level of comfort, openness, privacy, etc. Second, we propose an algorithm to predict the scores from the floor plan images. Lastly, we implement a new apartment search system and conduct a large-scale usability study using crowdsourcing. The experimental results show that our apartment search system can provide a better user experience. To the best of our knowledge, this is the first work to propose a highly accurate machine learning model for predicting the subjective functionality and comfort of apartments.

Abstract:
In general, manipulated videos will eventually undergo recompression. Video transcoding will occur when the standard of recompression is different from the prior standard. Therefore, as a special sign of recompression, video transcoding can also be considered evidence of forgery in video forensics. In this paper, we focus on the detection and localization of video transcoding from AVC to HEVC (AVC-HEVC). There are two probable cases of AVC-HEVC transcoding — whole video transcoding and partial frame transcoding. However, the existing forensic methods only consider the detection of whole video transcoding, and they do not consider partial frame transcoding localization. In view of this, we propose a framewise scheme based on a convolutional neural network. First, we analyze that the essential difference between AVC-HEVC and HEVC is reflected in the high-frequency components of decoded frames. Then, the partition and location information of prediction units (PUs) are introduced to generate frame-level PU maps to make full use of the local artifacts of PUs. Finally, taking the decoded frames and PU maps as inputs, a dual-path network including specific convolutional modules and an adaptive fusion module is proposed. Through it, the artifacts on a single frame can be better extracted, and the transcoded frames can be detected and localized. Coupled with a simple voting strategy, the results of whole transcoding detection can be easily obtained. A large number of experiments are conducted to verify the performances. The results show that the proposed scheme outperforms or rivals the state-of-the-art methods in AVC-HEVC transcoding detection and localization.