ACMMM2020

Abstract:
Image inpainting task requires filling the corrupted image with contents coherent with the context. This research field has achieved promising progress by using neural image inpainting methods. Nevertheless, there is still a critical challenge in guessing the missed content with only the context pixels. The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. Unique from existing text-guided image generation works, the inpainting models are required to compare the semantic content of the given text and the remaining part of the image, then find out the semantic content that should be filled for missing part. To fulfill such a task, we propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet). Firstly, a dual multimodal attention mechanism is designed to extract the explicit semantic information about the corrupted regions, which is done by comparing the descriptive text and complementary image areas through reciprocal attention. Secondly, an image-text matching loss is applied to maximize the semantic similarity of the generated image and the text. Experiments are conducted on two open datasets. Results show that the proposed TDANet model reaches new state-of-the-art on both quantitative and qualitative measures. Result analysis suggests that the generated images are consistent with the guidance text, enabling the generation of various results by providing different descriptions. Codes are available at https://github.com/idealwhite/TDANet

Abstract:
Inspired by the ability of human beings on recognizing the relations between visual scenes and sounds, many cross-modal learning methods have been developed for modeling images or videos and associated sounds. In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. LLINet is mainly desired to qualify for two tasks, i.e., image-audio cross-modal retrieval and sound localization in each image. Towards this end, it is designed as a two-branch encoding network that builds a common space for images and audios. Besides, a cross-modal attention mechanism is proposed in LLINet to localize sound objects. To evaluate LLINet, a new data set, named INSTRUMENT-32CLASS, is collected in this work. Besides zero-shot cross-modal retrieval and sound localization, a zero-shot image recognition task based on sounds is also conducted on this database. All experimental results on these tasks demonstrate the effectiveness of LLINet, indicating that zero-shot learning for visual scenes and sounds is feasible. The project page for LLINet is available at https://llinet.github.io/.

Abstract:
Learning spatiotemporal features is very effective but challenging for video understanding especially action recognition. In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. The contribution of MGMA is three-fold: First, by devising a new spatiotemporal separable attention mechanism, it can learn temporal attention and spatial attention separately for fine-grained spatiotemporal representation. Second, through designing a novel multi-group structure, it can capture multi-attention rendered spatiotemporal features better. Finally, our MGMA module is lightweight and flexible yet effective, so that can be easily embedded into any 3D Convolutional Neural Network (3D-CNN) architecture. We embed multiple MGMA modules into 3D-CNN to train an end-to-end, RGB-only model and evaluate on four popular benchmarks: UCF101 and HMDB51, Something-Something V1 and V2. Ablation study and experimental comparison demonstrate the strength of our MGMA, which achieves superior performance compared to state-of-the-arts. Our code is available at https://github.com/zhenglab/mgma.

Abstract:
While machine learning approaches to visual recognition offer great promise, most of the existing methods rely heavily on the availability of large quantities of labeled training data. However, in the vast majority of real-world settings, manually collecting such large labeled datasets is infeasible due to the cost of labeling data or the paucity of data in a given domain. In this paper, we present a novel Adversarial Knowledge Transfer (AKT) framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier on a given visual recognition task. The proposed adversarial learning framework aligns the feature space of the unlabeled source data with the labeled target data such that the target classifier can be used to predict pseudo labels on the source data. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task, unlike some existing approaches. Extensive experiments well demonstrate that models learned using our approach hold a lot of promise across a variety of visual recognition tasks on multiple standard datasets. Project page is at \texttthttps://agupt013.github.io/akt.html.

Abstract:
Videos have become the new preference comparing with images in recent years. However, during the recording of videos, the cameras are inevitably occluded by some objects or persons that pass through the cameras, which would highly increase the workload of video editors for searching out such occlusions. In this paper, for releasing the burden of video editors, a frame-level video occlusion detection method is proposed, which is a fundamental component of automatic video editing. The proposed method enhances the extraction of spatial-temporal information based on C3D yet only using around half amount of parameters, with an occlusion correction algorithm for correcting the prediction results. In addition, a novel loss function is proposed to better extract the characterization of occlusion and improve the detection performance. For performance evaluation, this paper builds a new large scale dataset, containing 1,000 video segments from seven different real-world scenarios, which could be available at: https://junhua-liao.github.io/Occlusion-Detection/. All occlusions in video segments are annotated frame by frame with bounding-boxes so that the dataset could be utilized in both frame-level occlusion detection and precise occlusion location. The experimental results illustrate that the proposed method could achieve good performance on video occlusion detection compared with the state-of-the-art approaches. To the best of our knowledge, this is the first study which focuses on occlusion detection for automatic video editing.

Abstract:
Increasing the visibility of nighttime hazy images is challenging because of uneven illumination from active artificial light sources and haze absorbing/scattering. The absence of large-scale benchmark datasets hampers progress in this area. To address this issue, we propose a novel synthetic method called 3R to simulate nighttime hazy images from daytime clear images, which first reconstructs the scene geometry, then simulates the light rays and object reflectance, and finally renders the haze effects. Based on it, we generate realistic nighttime hazy images by sampling real-world light colors from a prior empirical distribution. Experiments on the synthetic benchmark show that the degrading factors jointly reduce the image quality. To address this issue, we propose an optimal-scale maximum reflectance prior to disentangle the color correction from haze removal and address them sequentially. Besides, we also devise a simple but effective learning-based baseline which has an encoder-decoder structure based on the MobileNet-v2 backbone. Experiment results demonstrate their superiority over state-of-the-art methods in terms of both image quality and runtime. Both the dataset and source code will be available at https://github.com/chaimi2013/3R.

Abstract:
Human free-hand sketches have been studied in various fields including sketch recognition, synthesis and sketch-based image retrieval. We propose a new challenging task sketch enhancement (SE) defined in an ill-posed space, i.e. enhancing a non-professional sketch (NPS) to a professional sketch (PS), which is a creative generation task different from sketch abstraction, sketch completion and sketch variation. For the first time we release a database of NPS with PS for anime characters. We cast sketch enhancement as an image-to-image translation problem by exploiting the relationship to corresponding intensive or sparse pixel domains for sketch domain. Specifically, we explore three different routines based on conditional generative adversarial network (cGAN), i.e. Sketch-Sketch (SS), Sketch-Colorization-Sketch (SCS) and Sketch-Abstraction-Sketch (SAS). SS is a one-stage model that directly maps NPS to PS, while SCS and SAS are two-stage models where auxiliary inputs, grayscale parsing and shape parsing, are involved. Multiple metrics are used to evaluate the performance of the models in both the sketch domain and other low-level feature domains. With quantitative and qualitative analysis of the experiments, we have established solid baselines, which, we hope, could encourage more research conducted on this task. Our dataset is publicly available via https://github.com/LCXCUC/SketchMan2020.

Abstract:
We present a self-supervised approach with pose perceptual loss for automatic dance video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation metric to evaluate the dance quality, which is able to estimate the similarity between two modalities (music and dance). Finally, our experimental qualitative and quantitative results demonstrate that our dance video synthesis approach produces realistic and diverse results. Our source code and data are available at https://github.com/xrenaa/Music-Dance-Video-Synthesis.

Abstract:
In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. Formally, given a video and a syntactically valid exemplar sentence, the task aims to generate one caption which not only describes the semantic contents of the video, but also follows the syntactic form of the given exemplar sentence. In order to tackle such an exemplar-based video captioning task, we propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture. The proposed SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network with respect to the encoded syntactic information of the given exemplar sentence. Therefore, SMCG is able to control the states for word prediction and achieve the syntax customized caption generation. We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets. Extensive experimental results demonstrate the effectiveness of our approach on generating syntax controllable and semantic preserved video captions. By providing different exemplar sentences, our approach is capable of producing different captions with various syntactic structures, thus indicating a promising way to strengthen the diversity of video captioning. Code for this paper is available at https://github.com/yytzsy/SMCG.

Abstract:
3D point clouds are often perturbed by noise due to the inherent limitation of acquisition equipments, which obstructs downstream tasks such as surface reconstruction, rendering and so on. Previous works mostly infer the displacement of noisy points from the underlying surface, which however are not designated to recover the surface explicitly and may lead to sub-optimal denoising results. To this end, we propose to learn the underlying manifold of a noisy point cloud from differentiably subsampled points with trivial noise perturbation and their embedded neighborhood feature, aiming to capture intrinsic structures in point clouds. Specifically, we present an autoencoder-like neural network. The encoder learns both local and non-local feature representations of each point, and then samples points with low noise via an adaptive differentiable pooling operation. Afterwards, the decoder infers the underlying manifold by transforming each sampled point along with the embedded feature of its neighborhood to a local surface centered around the point. By resampling on the reconstructed manifold, we obtain a denoised point cloud. Further, we design an unsupervised training loss, so that our network can be trained in either an unsupervised or supervised fashion. Experiments show that our method significantly outperforms state-of-the-art denoising methods under both synthetic noise and real world noise. The code and data are available at https://github.com/luost26/DMRDenoise

Abstract:
Semantic segmentation requires a large amount of densely annotated data for training and may generalize poorly to novel categories. In real-world applications, we have an urgent need for few-shot semantic segmentation which aims to empower a model to handle unseen object categories with limited data. This task is non-trivial due to several challenges. First, it is difficult to extract the class-relevant information to handle the novel class as only a few samples are available. Second, since the image content can be very complex, the novel class information may be suppressed by the base categories due to limited data. Third, one may easily learn promising base classifiers based on a large amount of training data, but it is non-trivial to exploit the knowledge to train the novel classifiers. More critically, once a novel classifier is built, the output probability space will change. How to maintain the base classifiers and dynamically include the novel classifiers remains an open question. To address the above issues, we propose a Dynamic Extension Network (DENet) in which we dynamically construct and maintain a classifier for the novel class by leveraging the knowledge from the base classes and the information from novel data. More importantly, to overcome the information suppression issue, we design a Guided Attention Module (GAM), which can be plugged into any framework to help learn class-relevant features. Last, rather than directly train the model with limited data, we propose a dynamic extension training algorithm to predict the weights of novel classifiers, which is able to exploit the knowledge of base classifiers by dynamically extending classes during training. The extensive experiments show that our proposed method achieves state-of-the-art performance on the PASCAL-5i and COCO-20i datasets. The source code is available at https://github.com/lizhaoliu-Lec/DENet.

Abstract:
Although pedestrian detection has made significant progress with the help of deep convolution neural networks, it is still a challenging problem to detect occluded pedestrians since the occluded ones can not provide sufficient information for classification and regression. In this paper, we propose a novel Hierarchical Graph Pedestrian Detector (HGPD), which integrates semantic and spatial relation information to construct two graphs named intra-proposal graph and inter-proposal graph, without relying on extra cues w.r.t visible regions. In order to capture the occlusion patterns and enhance features from visible regions, the intra-proposal graph considers body parts as nodes and assigns corresponding edge weights based on semantic relations between body parts. On the other hand, the inter-proposal graph adopts spatial relations between neighbouring proposals to provide additional proposal-wise context information for each proposal, which alleviates the lack of information caused by occlusion. We conduct extensive experiments on standard benchmarks of CityPersons and Caltech to demonstrate the effectiveness of our method. On CityPersons, our approach outperforms the baseline method by a large margin of 5.24pp on the heavy occlusion set, and surpasses all previous methods; on Caltech, we establish a new state of the art of 3.78% MR. Code is available at https://github.com/ligang-cs/PedestrianDetection-HGPD.

Abstract:
Phrase grounding aims to localize the objects described by phrases in a natural language specification. Previous works model the interaction of inputs from text modality and visual modality only in the intra-modal global level and consequently lacks the ability to capture the precise and complete context information. In this paper, we propose a novel Cross-Modal Omni Interaction network (COI Net) composed of a neighboring interaction module, a global interaction module, a cross-modal interaction module and a multilevel alignment module. Our approach formulates the complex spatial and semantic relationship among image regions and phrases through multi-level multi-modal interaction. We capture the local relationship using the interaction among neighboring regions and then collect the global context through the interaction among all regions using a transformer encoder. We further use a co-attention module to apply the interaction between two modalities to gather the cross-modal context for all image regions and phrases. In addition to the omni interaction modeling, we also leverage a straightforward yet effective multilevel alignment regularization to formulate the dependencies among all grounding decisions. We extensively validate the effectiveness of our model. Experiments show that our approach outperforms existing state-of-the-art methods by large margins on two popular datasets in terms of accuracy: 6.15% on Flickr30K Entities (71.36% increased to 77.51%) and 21.25% on ReferItGame (44.91% increased to 66.16%). The code of our implementation is available at https://github.com/yiranyyu/Phrase-Grounding.

Abstract:
Medical visual question answering (Med-VQA) aims to accurately answer a clinical question presented with a medical image. Despite its enormous potential in healthcare industry and services, the technology is still in its infancy and is far from practical use. Med-VQA tasks are highly challenging due to the massive diversity of clinical questions and the disparity of required visual reasoning skills for different types of questions. In this paper, we propose a novel conditional reasoning framework for Med-VQA, aiming to automatically learn effective reasoning skills for various Med-VQA tasks. Particularly, we develop a question-conditioned reasoning module to guide the importance selection over multimodal fusion features. Considering the different nature of closed-ended and open-ended Med-VQA tasks, we further propose a type-conditioned reasoning module to learn a different set of reasoning skills for the two types of tasks separately. Our conditional reasoning framework can be easily applied to existing Med-VQA systems to bring performance gains. In the experiments, we build our system on top of a recent state-of-the-art Med-VQA model and evaluate it on the VQA-RAD benchmark [23]. Remarkably, our system achieves significantly increased accuracy in predicting answers to both closed-ended and open-ended questions, especially for open-ended questions, where a 10.8% increase in absolute accuracy is obtained. The source code can be downloaded from https://github.com/awenbocc/med-vqa.

Abstract:
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications. However, most previous works relied on heavy backbone networks and required prohibitive run-time consumption, which would seriously restrict their deployment scopes and cause poor scalability. To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework, which fully exploits the structured knowledge of a well-trained teacher network to generate a lightweight but still highly effective student network. Specifically, it is integrated with two complementary transfer modules, including an Intra-Layer Pattern Transfer which sequentially distills the knowledge embedded in layer-wise features of the teacher network to guide feature learning of the student network and an Inter-Layer Relation Transfer which densely distills the cross-layer correlation knowledge of the teacher to regularize the student's feature evolution. Consequently, our student network can derive the layer-wise and cross-layer knowledge from the teacher network to learn compact yet effective features. Extensive evaluations on three benchmarks well demonstrate the effectiveness of our SKT for extensive crowd counting models. In particular, only using around 6% of the parameters and computation cost of original models, our distilled VGG-based models obtain at least 6.5× speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance. Our code and models are available at https://github.com/HCPLab-SYSU/SKT.

Abstract:
This demonstration presents an instant and progressive cross-modality person search system, called 'CMPS'. Through the system, users can instantly find the lost children or elderly persons by simply describing their appearance through speech. Unlike most existing person search applications which have to cost much time to find the probe images, CMPS will save more valuable time in the early stage of losing. The proposed CMPS is one of the first attempts towards instant and progressive person search leveraging the audio, text, and visual modalities together. In detail, the system first takes the speech that describes the appearance of a person as the input to obtain a textual description by speech-to-text conversion. Then the cross-modal search is performed by matching the textual embedding with the visual representations of images in the learned latent space. The searched images can be used as candidates for query expansion. If the candidates are not right, the user can quickly adjust their description through speech. Once a right image is found, the user can directly click it as a new query. Finally the system will give the complete track of the lost person by once-click. On the built CUHK-PEDES-AUDIOS dataset, the system can achieve 82.46% rank-1 accuracy in real-time speed. Our code of CMPS is available at https://github.com/SheldongChen/Search-People-With-Audio.

Abstract:
Recent domain adaptation work tends to obtain a uniformed representation in an adversarial manner through joint learning of the domain discriminator and feature generator. However, this domain adversarial approach could render sub-optimal performances due to two potential reasons: First, it might fail to consider the task at hand when matching the distributions between the domains. Second, it generally treats the source and target domain data in the same way. In our opinion, the source domain data which serves the feature adaption purpose should be supplementary, whereas the target domain data mainly needs to consider the task-specific classifier. Motivated by this, we propose a dual adversarial network for domain adaptation, where two adversarial learning processes are conducted iteratively, in correspondence with the feature adaptation and the classification task respectively. The efficacy of the proposed method is first demonstrated on Visual Domain Adaptation Challenge (VisDA) 2017 challenge, and then on two newly proposed Ground/Satellite-to-Aerial Scene adaptation tasks. For the proposed tasks, the data for the same scene is collected not only by the traditional camera on the ground, but also by satellite from the out space and unmanned aerial vehicle (UAV) at the high-altitude. Since the semantic gap between the ground/satellite scene and the aerial scene is much larger than that between ground scenes, the newly proposed tasks are more challenging than traditional domain adaptation tasks. The datasets/codes can be found at https://github.com/jianzhelin/DuAN.

Abstract:
Multimodal machine translation (MMT), which mainly focuses on enhancing text-only translation with visual features, has attracted considerable attention from both computer vision and natural language processing communities. Most current MMT models resort to attention mechanism, global context modeling or multimodal joint representation learning to utilize visual features. However, the attention mechanism lacks sufficient semantic interactions between modalities while the other two provide fixed visual context, which is unsuitable for modeling the observed variability when generating translation. To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at each timestep of decoding, we first employ the conventional source-target attention to produce a timestep-specific source-side context vector. Next, DCCN takes this vector as input and uses it to guide the iterative extraction of related visual features via a context-guided dynamic routing mechanism. Particularly, we represent the input image with global and regional visual features, we introduce two parallel DCCNs to model multimodal context vectors with visual features at different granularities. Finally, we obtain two multimodal context vectors, which are fused and incorporated into the decoder for the prediction of the target word. Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on https://github.com/DeepLearnXMU/MM-DCCN.

Abstract:
In recent years, discriminative trackers show its great tracking performance, that is mainly due to the online updating using samples collected during tracking. The model could adapt appearance changes of objects and the background well after updating. But these trackers have a serious disadvantage that wrong samples may cause severe model degradation. Most of the training samples in the tracking phase are obtained according to the tracking result of the current frame. Wrong training samples will be collected when the tracking result is inaccurate, seriously affecting the discrimination ability of the model. Besides, partial occlusion also leads to the same problem. In this paper, we propose an optimization module named MetricNet for online filtering training samples. It applies a matching network containing the classification and distance branches, and uses multiple metric methods for different type samples. MetricNet optimizes the training sample set by recognizing wrong and redundant samples, thereby improving the tracking performance. The proposed MetricNet can be regarded as an independent optimization module and integrated into all discriminative trackers updated online. Extensive experiments on three tracking datasets show its effectiveness and generalization ability. After applying MetricNet to MDNet, the tracking result is increased by 5.3% in terms of the success plot on the LaSOT dataset. Our project is available at https://github.com/zj5559/MetricNet.

Abstract:
Existing face restoration researches typically rely on either the image degradation prior or explicit guidance labels for training, which often lead to limited generalization ability over real-world images with heterogeneous degradation and rich background contents. In this paper, we investigate a more challenging and practical "dual-blind" version of the problem by lifting the requirements on both types of prior, termed as "Face Renovation"(FR). Specifically, we formulate FR as a semantic-guided generation problem and tackle it with a collaborative suppression and replenishment (CSR) approach. This leads to HiFaceGAN, a multi-stage framework containing several nested CSR units that progressively replenish facial details based on the hierarchical semantic guidance extracted from the front-end content-adaptive suppression modules. Extensive experiments on both synthetic and real face images have verified the superior performance of our HiFaceGAN over a wide range of challenging restoration subtasks, demonstrating its versatility, robustness and generalization ability towards real-world face processing applications. Code is available at https://github.com/Lotayou/Face-Renovation.

Abstract:
In recent years, the abuse of a face swap technique called deepfake has raised enormous public concerns. So far, a large number of deepfake videos (known as "deepfakes") have been crafted and uploaded to the internet, calling for effective countermeasures. One promising countermeasure against deepfakes is deepfake detection. Several deepfake datasets have been released to support the training and testing of deepfake detectors, such as DeepfakeDetection [1] and FaceForensics++ [23]. While this has greatly advanced deepfake detection, most of the real videos in these datasets are filmed with a few volunteer actors in limited scenes, and the fake videos are crafted by researchers using a few popular deepfake softwares. Detectors developed on these datasets may become less effective against real-world deepfakes on the internet. To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop and test the effectiveness of deepfake detectors against real-world deepfakes. We conduct a systematic evaluation of a set of baseline detection networks on both existing and our WildDeepfake datasets, and show that WildDeepfake is indeed a more challenging dataset, where the detection performance can decrease drastically. We also propose two (eg. 2D and 3D) Attention-based Deepfake Detection Networks (ADDNets) to leverage the attention masks on real/fake faces for improved detection. We empirically verify the effectiveness of ADDNets on both existing datasets and WildDeepfake. The dataset is available at: https://github.com/deepfakeinthewild/deepfake-in-the-wild.

Abstract:
Traffic accident anticipation aims to predict accidents from dashcam videos as early as possible, which is critical to safety-guaranteed self-driving systems. With cluttered traffic scenes and limited visual cues, it is of great challenge to predict how long there will be an accident from early observed frames. Most existing approaches are developed to learn features of accident-relevant agents for accident anticipation, while ignoring the features of their spatial and temporal relations. Besides, current deterministic deep neural networks could be overconfident in false predictions, leading to high risk of traffic accidents caused by self-driving systems. In this paper, we propose an uncertainty-based accident anticipation model with spatio-temporal relational learning. It sequentially predicts the probability of traffic accident occurrence with dashcam videos. Specifically, we propose to take advantage of graph convolution and recurrent networks for relational feature learning, and leverage Bayesian neural networks to address the intrinsic variability of latent relational representations. The derived uncertainty-based ranking loss is found to significantly boost model performance by improving the quality of relational features. In addition, we collect a new Car Crash Dataset (CCD) for traffic accident anticipation which contains environmental attributes and accident reasons annotations. Experimental results on both public and the newly-compiled datasets show state-of-the-art performance of our model. Our code and CCD dataset are available at https://github.com/Cogito2012/UString.

Abstract:
Heterogeneous domain adaptation (HDA) transfers knowledge across source and target domains that present heterogeneities e.g., distinct domain distributions and difference in feature type or dimension. Most previous HDA methods tackle this problem through learning a domain-invariant feature subspace to reduce the discrepancy between domains. However, the intrinsic semantic properties contained in data are under-explored in such alignment strategy, which is also indispensable to achieve promising adaptability. In this paper, we propose a Simultaneous Semantic Alignment Network (SSAN) to simultaneously exploit correlations among categories and align the centroids for each category across domains. In particular, we propose an implicit semantic correlation loss to transfer the correlation knowledge of source categorical prediction distributions to target domain. Meanwhile, by leveraging target pseudo-labels, a robust triplet-centroid alignment mechanism is explicitly applied to align feature representations for each category. Notably, a pseudo-label refinement procedure with geometric similarity involved is introduced to enhance the target pseudo-label assignment accuracy. Comprehensive experiments on various HDA tasks across text-to-image, image-to-image and text-to-text successfully validate the superiority of our SSAN against state-of-the-art HDA methods. The code is publicly available at https://github.com/BIT-DA/SSAN.

Abstract:
We study the problem of text-based visual question answering (T-VQA) in this paper. Unlike general visual question answering (VQA) which only builds connections between questions and visual contents, T-VQA requires reading and reasoning over both texts and visual concepts that appear in images. Challenges in T-VQA mainly lie in three aspects: 1) It is difficult to understand the complex logic in questions and extract specific useful information from rich image contents to answer them; 2) The text-related questions are also related to visual concepts, but it is difficult to capture cross-modal relationships between the texts and the visual concepts; 3) If the OCR (optical character recognition) system fails to detect the target text, the training will be very difficult. To address these issues, we propose a novel Cascade Reasoning Network (CRN) that consists of a progressive attention module (PAM) and a multimodal reasoning graph (MRG) module. Specifically, the PAM regards the multimodal information fusion operation as a stepwise encoding process and uses the previous attention results to guide the next fusion process. The MRG aims to explicitly model the connections and interactions between texts and visual concepts. To alleviate the dependence on the OCR system, we introduce an auxiliary task to train the model with accurate supervision signals, thereby enhancing the reasoning ability of the model in question answering. Extensive experiments on three popular T-VQA datasets demonstrate the effectiveness of our method compared with SOTA methods. The source code is available at https://github.com/guanghuixu/CRN_tvqa.

Abstract:
A crucial task in scene understanding is 3D object detection, which aims to detect and localize the 3D bounding boxes of objects belonging to specific classes. Existing 3D object detectors heavily rely on annotated 3D bounding boxes during training, while these annotations could be expensive to obtain and only accessible in limited scenarios. Weakly supervised learning is a promising approach to reducing the annotation requirement, but existing weakly supervised object detectors are mostly for 2D detection rather than 3D. In this work, we propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. First, we introduce an unsupervised 3D proposal module that generates object proposals by leveraging normalized point cloud densities. Second, we present a cross-modal knowledge distillation strategy, where a convolutional neural network learns to predict the final results from the 3D object proposals by querying a teacher network pretrained on image datasets. Comprehensive experiments on the challenging KITTI dataset demonstrate the superior performance of our VS3D in diverse evaluation settings. The source code and pretrained models are publicly available at https://github.com/Zengyi-Qin/Weakly-Supervised-3D-Object-Detection.

Abstract:
Video anomaly detection is an essential task in computer vision which attracts massive attention from academia and industry. The existing approaches are implemented in diverse deep learning frameworks and settings, making it difficult to reproduce the results published by the original authors. Undoubtedly, this phenomenon is detrimental to the development of Video Anomaly detection and community communication. In this paper, we present a PyTorch-based video anomaly detection toolbox, namely PyAnomaly that contains high modular and extensible components, comprehensive and impartial evaluation platforms, a friendly manageable system configuration, and the abundant engineering deployment functions. To make it easy-to-use and easy-to-extend, we implement the architecture by hooks and registers functionality. Remarkably, we have reproduced the comparable experimental results of six representative methods as those published by the original authors, and we will release these pre-trained models with more rich configurations. To our best knowledge, the PyAnomaly is the first open-source tool in video anomaly detection and is available at https://github.com/YuhaoCheng/PyAnomaly.

Abstract:
Arbitrary-shaped text detection is a challenging task due to the complex geometric layouts of texts such as large aspect ratios, various scales, random rotations and curve shapes. Most state-of-the-art methods solve this problem from bottom-up perspectives, seeking to model a text instance of complex geometric layouts with simple local units (e.g., local boxes or pixels) and generate detections with heuristic post-processings. In this work, we propose an arbitrary-shaped text detection method, namely TextRay, which conducts top-down contour-based geometric modeling and geometric parameter learning within a single-shot anchor-free framework. The geometric modeling is carried out under polar system with a bidirectional mapping scheme between shape space and parameter space, encoding complex geometric layouts into unified representations. For effective learning of the representations, we design a central-weighted training strategy and a content loss which builds propagation paths between geometric encodings and visual content. TextRay outputs simple polygon detections at one pass with only one NMS post-processing. Experiments on several benchmark datasets demonstrate the effectiveness of the proposed approach. The code is available at https://github.com/LianaWang/TextRay.

Abstract:
This paper explores the problem of reconstructing high-resolution light field (LF) images from hybrid lenses, including a high-resolution camera surrounded by multiple low-resolution cameras. To tackle this challenge, we propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives. Specifically, one module regresses a spatially consistent intermediate estimation by learning a deep multidimensional and cross-domain feature representation; the other one constructs another intermediate estimation, which maintains the high-frequency textures, by propagating the information of the high-resolution view. We finally leverage the advantages of the two intermediate estimations via the learned attention maps, leading to the final high-resolution LF image. Extensive experiments demonstrate the significant superiority of our approach over state-of-the-art ones. That is, our method not only improves the PSNR by more than 2 dB, but also preserves the LF structure much better. To the best of our knowledge, this is the first end-to-end deep learning method for reconstructing a high-resolution LF image with a hybrid input. We believe our framework could potentially decrease the cost of high-resolution LF data acquisition and also be beneficial to LF data storage and transmission. The code is available at https://github.com/jingjin25/LFhybridSR-Fusion.

Abstract:
Image enhancement from degradation of rainy artifacts plays a critical role in outdoor visual computing systems. In this paper, we tackle the notion of scale that deals with visual changes in appearance of rain steaks with respect to the camera. Specifically, we revisit multi-scale representation by scale-space theory, and propose to represent the multi-scale correlation in convolutional feature domain, which is more compact and robust than that in pixel domain. Moreover, to improve the modeling ability of the network, we do not treat the extracted multi-scale features equally, but design a novel scale-space invariant attention mechanism to help the network focus on parts of the features. In this way, we summarize the most activated presence of feature maps as the salient features. Extensive experiments results on synthetic and real rainy scenes demonstrate the superior performance of our scheme over the state-of-the-arts. The source code of our method can be found in: https://github.com/pangbo1997/RainRemoval.

Abstract:
Humans can perceive subtle emotions from various cues and contexts, even without hearing or seeing others. However, existing video datasets mainly focus on recognizing the emotions of the speakers from complete modalities. In this work, we present the task of multimodal emotion reasoning in videos. Beyond directly recognizing emotions from multimodal signals of target persons, this task requires a machine capable of reasoning about human emotions from the contexts and surrounding world. To facilitate the study towards this task, we introduce a new dataset, MEmoR, that provides fine-grained emotion annotations for both speakers and non-speakers. The videos in MEmoR are collected from TV shows closely in real-life scenarios. In these videos, while speakers may be non-visually described, non-speakers always deliver no audio-textual signals and are often visually inconspicuous. This modality-missing characteristic makes MEmoR a more practical yet challenging testbed for multimodal emotion reasoning. In support of various reasoning behaviors, the proposed MEmoR dataset provides both short-term contexts and external knowledge. We further propose an attention-based reasoning approach to model the intra-personal emotion contexts, inter-personal emotion propagation, and the personalities of different individuals. Experimental results demonstrate that our proposed approach outperforms related baselines significantly. We isolate and analyze the validity of different reasoning modules across various emotions of speakers and non-speakers. Finally, we draw forth several future research directions for multimodal emotion reasoning with MEmoR, aiming to empower high Emotional Quotient (EQ) in modern artificial intelligence systems. The code and dataset released on https://github.com/sunlightsgy/MEmoR.

Abstract:
Multi-person pose estimation has achieved great progress in recent years, even though, the precise prediction for occluded and invisible hard keypoints remains challenging. Most of the human pose estimation networks are equipped with an image classification-based pose encoder for feature extraction and a handcrafted pose decoder for high-resolution representations. However, the pose encoder might be sub-optimal because of the gap between image classification and pose estimation. The widely used multi-scale feature fusion in pose decoder is still coarse and cannot provide sufficient high-resolution details for hard keypoints. Neural Architecture Search (NAS) has shown great potential in many visual tasks to automatically search efficient networks. In this work, we present the Pose-native Network Architecture Search (PoseNAS) to simultaneously design a better pose encoder and pose decoder for pose estimation. Specifically, we directly search a data-oriented pose encoder with stacked searchable cells, which can provide an optimum feature extractor for the pose specific task. In the pose decoder, we exploit scale-adaptive fusion cells to promote rich information exchange across the multi-scale feature maps. Meanwhile, the pose decoder adopts a Fusion-and-Enhancement manner to progressively boost the high-resolution representations that are non-trivial for the precious prediction of hard keypoints. With the exquisitely designed search space and search strategy, PoseNAS can simultaneously search all modules in an end-to-end manner. PoseNAS achieves state-of-the-art performance on three public datasets, MPII, COCO, and PoseTrack, with small-scale parameters compared with the existing methods. Our best model obtains 76.7% mAP and 75.9% mAP on the COCO validation set and test set with only 33.6M parameters. Code and implementation are available at https://github.com/for-code0216/PoseNAS.

Abstract:
Despite the previous success of object analysis, detecting and segmenting a large number of object categories with a long-tailed data distribution remains a challenging problem and is less investigated. For a large-vocabulary classifier, the chance of obtaining noisy logits is much higher, which can easily lead to a wrong recognition. In this paper, we exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes, and construct a classification tree that is responsible for parsing an object instance into a fine-grained category via its parent class. In the classification tree, as the number of parent class nodes are significantly less, their logits are less noisy and can be utilized to suppress the wrong/noisy logits existed in the fine-grained class nodes. As the way to construct the parent class is not unique, we further build multiple trees to form a classification forest where each tree contributes its vote to the fine-grained classification. To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models for recognizing more than 1000 categories. Extensive experiments are performed on the large vocabulary dataset LVIS. Compared with the Mask R-CNN baseline, the Forest R-CNN significantly boosts the performance with 11.5% and 3.9% AP improvements on the rare categories and overall categories, respectively. Moreover, we achieve state-of-the-art results on the LVIS dataset. Code is available at https://github.com/JialianW/Forest_RCNN.

Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging and essential task in night-time intelligent surveillance systems. Except for the intra-modality variance that RGB-RGB person re-identification mainly overcomes, VI-ReID suffers from additional inter-modality variance caused by the inherent heterogeneous gap. To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modality feature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. Disentangling cross-modality identity-discriminable features leads to more robust retrieval for VI-ReID. To achieve efficient optimization like conventional VAE, we theoretically derive two variational inference terms for the MoG prior under the supervised setting, which not only restricts the identity-discriminable subspace so that the model explicitly handles the cross-modality intra-identity variance, but also enables the MoG distribution to avoid posterior collapse. Furthermore, we propose a triplet swap reconstruction (TSR) strategy to promote the above disentangling process. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on two VI-ReID datasets. Codes will be available at https://github.com/TPCD/DG-VAE.

Abstract:
In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features. Our project website is available at https://langlandslin.github.io/projects/MSL/.

Abstract:
3D point cloud data is an important data source for autonomous vehicles to perceive the surroundings. Achieving accurate object tracking of 3D point clouds has become a challenging task. In this paper, we propose a 3D object two-stage re-track framework directly utilizing point clouds as the input, without using the ground truth as the reference box. The framework consists of a coarse stage and a fine stage. By tracking back the previous T frames and expanding the search space for each frame, we add the fine stage to re-track the lost objects of the coarse stage. Moreover, we design a dense AutoEncoder to enhance the discrimination in the latent space and improve shape completion performance, thus improving tracking performance. A Sample Update Strategy is also proposed to aggregate similar model shape samples in different frames, which improves the quality of the model shape. In terms of motion models for the proposed re-track framework, we further compare Kalman Filter with PointLSTM and do an extensive analysis. Finally, we test the re-track framework on the KITTI tracking dataset and outperform the public benchmark by 17.1%/15.5% in Success and Precision, respectively. Our code and model are available at https://github.com/FengZicai/Re-Track.

Abstract:
Lab2Pix refers to the task of generating photo-realistic images from labels, e.g., semantic labels or sketch labels. Despite inheriting from image-to-image translation, Lab2Pix develops its own characteristics due to the differences between labels and general images. This prevents Lab2Pix task from simply applying general image-to-image translation models. Therefore, we propose an unsupervised framework named Lab2Pix to adaptively synthesize images from labels by elegantly considering the particular properties of label to image synthesis task. Specifically, since the labels contain much less information than the images, we design our generator in a cumulative style which gradually renders synthesized images by fusing features in different levels. Accordingly, the verification process feeds the generated images to a segmentation component and compares the results to the original input label. Furthermore, we propose a sharp enhancement loss, an image consistency loss and a foreground enhancement mask to encourage the network to synthesize photo-realistic images. Experiments conducted on Cityscapes, Facades, Edge2shoes and Edge2handbags datasets demonstrate that our Lab2Pix significantly outperforms existing state-of-the-art unsupervised methods and is even comparable to supervised methods. The source code is available at https://github.com/RoseRollZhu/Lab2Pix.

Abstract:
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we re-train several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.

Abstract:
There is a huge market demand for searching for products by images in e-commerce sites. Visual features play the most important role in solving this content-based image retrieval task. Most existing methods leverage pre-trained models on other large-scale datasets with well-annotated labels, e.g. the ImageNet dataset, to extract visual features. However, due to the large difference between the product images and the images in ImageNet, the feature extractor trained on ImageNet is not efficient in extracting the visual features of product images. And retraining the feature extractor on the product images is faced with the dilemma of lacking the annotated labels. In this paper, we utilize the easily accessible text information, that is, the product title, as a supervised signal to learn the features of the product image. Specifically, we use the n-grams extracted from the product title as the label of the product image to construct a dataset for image classification. This dataset is then used to fine-tuned a pre-trained model. Finally, the basic max-pooling activation of convolutions (MAC) feature is extracted from the fine-tuned model. As a result, we achieve the fourth position in the Grand Challenge of AI Meets Beauty in 2020 ACM Multimedia by using only a single ResNet-50 model without any human annotations and pre-processing or post-processing tricks. Our code is available at: \urlhttps://github.com/FangxiangFeng/AI-Meets-Beauty-2020.

Abstract:
Virtual try-on has attracted lots of research attention due to its potential applications in e-commerce, virtual reality and fashion design. However, existing methods can hardly preserve the fine-grained details (e.g., clothing texture, facial identity, hair style, skin tone) during generation, due to the non-rigid body deformation and multi-scale details. In this work, we propose a multi-stage framework to synthesize person images, where fine-grained details can be well preserved. To address the long-range translation and rich-details generation, we propose a Tree-Block (tree dilated fusion block) to replace standard ResNet-block where applicable. Notably, multi-scale feature maps can be smoothly fused for fine-grained detail generation, by incorporating larger spatial context at multiple scales. With a delicate end-to-end training scheme, our whole framework can be jointly optimized for results with significantly better visual fidelity and richer details. Moreover, we also explore the potential application in video-based virtual try-on. By harnessing the well-trained image generator and an extra video-level adaptor, a model photo can be well animated with a driving pose sequence. Extensive evaluations on standard datasets and user study demonstrate that our proposed framework achieves the state-of-the-art results, especially in preserving visual details in clothing texture and facial identity. Our implementation is publicly available via https://github.com/JDAI-CV/Down-to-the-Last-Detail-Virtual-Try-on-with-Detail-Carving.

Abstract:
Fashion manipulation has attracted growing interest due to its great application value, which inspires many researches towards fashion images. However, little attention has been paid to fashion design draft. In this paper, we study a new unaligned translation problem between design drafts and real fashion items, whose main challenge lies in the huge misalignment between the two modalities. We first collect paired design drafts and real fashion item images without pixel-wise alignment. To solve the misalignment problem, our main idea is to train a sampling network to adaptively adjust the input to an intermediate state with structure alignment to the output. Moreover, built upon the sampling network, we present design draft to real fashion item translation network (D2RNet), where two separate translation streams that focus on texture and shape, respectively, are combined tactfully to get both benefits. D2RNet is able to generate realistic garments with both texture and shape consistency to their design drafts. We show that this idea can be effectively applied to the reverse translation problem and present R2DNet accordingly. Extensive experiments on unaligned fashion design translation demonstrate the superiority of our method over state-of-the-art methods. Our project website is available at: https://victoriahy.github.io/MM2020/.

Abstract:
Due to the significant development of deep learning (DL) techniques, recent advances in the super-resolution (SR) field have achieved a great performance. While seeking for better performance, the later proposed networks prone to be deeper and heavier, which limits the applications of SR algorithms in the resource-constrain devices. Some advances rely on recurrent/recursive learning to reduce the number of network parameters, however, they ignore the caused long inference time, since the more recurrences/recursions are involved, the longer inference time the network needs. To address this trade-off issue between reconstruction performance, the number of network parameters, and inference time, we propose a lightweight and fast network (WSR) to learn wavelet coefficients of the target image progressively for single image super-resolution. More specifically, the network comprises two main branches. One is used for predicting the second level low-frequency wavelet coefficients, and the other one is designed in a recurrent way for predicting the rest wavelet coefficients at the first and second levels. Finally, an inverse wavelet transformation is adopted to reconstruct the SR images from these coefficients. In addition, we propose a deformable convolution kernel (side window) to construct the side-information multi-distillation block (S-IMDB), which is the basic unit of the recurrent blocks (RBs). We train the WSR with loss constraints at wavelet and spatial domains. Comprehensive experiments demonstrate that our WSR achieves a better trade-off than most of the state-of-the-art approaches. Code is available at https://github.com/FVL2020/WSR.

Abstract:
Recent studies have shown that deep neural networks (DNNs) are susceptible to adversarial attacks even in the black-box settings. However, previous studies on creating black-box based adversarial examples by merely solving the traditional continuous problem, which suffer query efficiency issues. To address the efficiency of querying in black-box attack, we propose a novel attack, called MGAAttack, which is a query-efficient and gradient-free black-box attack without obtaining any knowledge of the target model. In our approach, we leverage the advantages of both transfer-based and scored-based methods, two typical techniques in black-box attack, and solve a discretized problem by using a simple yet effective microbial genetic algorithm (MGA). Experimental results show that our approach dramatically reduces the number of queries on CIFAR-10 and ImageNet and significantly outperforms previous work. In the untargeted attack, we can attack a VGG19 classifier with only 16 queries and give an attack success rate more than 99.90% on ImageNet. Our code is available at https://github.com/kangyangWHU/MGAAttack.

Abstract:
Monitoring the population and movements of endangered species is an important task to wildlife conversation. Traditional tagging methods do not scale to large populations, while applying computer vision methods to camera sensor data requires re-identification (re-ID) algorithms to obtain accurate counts and moving trajectory of wildlife. However, existing re-ID methods are largely targeted at persons and cars, which have limited pose variations and constrained capture environments. This paper tries to fill the gap by introducing a novel large-scale dataset, the Amur Tiger Re-identification in the Wild (ATRW) dataset. ATRW contains over 8,000 video clips from 92 Amur tigers, with bounding box, pose keypoint, and tiger identity annotations. In contrast to typical re-ID datasets, the tigers are captured in a diverse set of unconstrained poses and lighting conditions. We demonstrate with a set of baseline algorithms that ATRW is a challenging dataset for re-ID. Lastly, we propose a novel method for tiger re-identification, which introduces precise pose parts modeling in deep neural networks to handle large pose variation of tigers, and reaches notable performance improvement over existing re-ID methods. The ATRW dataset is public available at https://cvwc2019.github.io/challenge.html

Abstract:
Event analysis in untrimmed videos has attracted increasing attention due to the application of cutting-edge techniques such as CNN. As a well studied property for CNN-based models, the receptive field is a measurement for measuring the spatial range covered by a single feature response, which is crucial in improving the image categorization accuracy. In video domain, video event semantics are actually described by complex interaction among different concepts, while their behaviors vary drastically from one video to another, leading to the difficulty in concept-based analytics for accurate event categorization. To model the concept behavior, we study temporal concept receptive field of concept-based event representation, which encodes the temporal occurrence pattern of different mid-level concepts. Accordingly, we introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics. TDC can adjust the temporal concept receptive field size dynamically according to different inputs. Notably, a set of coefficients are learned to fuse the results of multiple convolutions with different kernel widths that provide various temporal concept receptive field sizes. Different coefficients can generate appropriate and accurate temporal concept receptive field size according to input videos and highlight crucial concepts. Based on TDC, we propose the temporal dynamic concept modeling network~(TDCMN) to learn an accurate and complete concept representation for efficient untrimmed video analysis. Experiment results on FCVID and ActivityNet show that TDCMN demonstrates adaptive event recognition ability conditioned on different inputs, and improve the event recognition performance of Concept-based methods by a large margin. Code is available at https://github.com/qzhb/TDCMN.

Abstract:
Event recognition of untrimmed video is a challenging task due to the big gap between low level visual features and event semantics. Beyond feature learning via deep neural networks, some recent works focus on analyzing event videos using concept-based representation. However, these methods simply aggregate the concept representation vectors of frames or segments, which inevitably introduces information loss on video-level concept knowledge. Moreover, the diversified relation between different concept domains (e.g., scene, object and action) has not been fully explored. To address the above issues, we propose a concept knowledge mining network (CKMN) for event recognition. CKMN is composed of an intra-domain concept knowledge mining subnetwork (IaCKM) and an inter-domain concept knowledge mining subnetwork~(IrCKM). IaCKM aims to obtain a complete concept representation by mining the existing pattern of each concept at different time granularities with dilated temporal pyramid convolution and temporal self-attention, while IrCKM explores the interaction between different types of concepts with co-attention style learning. We evaluate our method on FCVID and ActivityNet datasets. Experimental results show the effectiveness and better interpretability of our model on event analytics. Code is available at https://github.com/qzhb/CKMN.

Abstract:
In this companion paper, firstly, we briefly summarize the contributions of our main manuscript: Selective Deep Convolutional Features for Image Retrieval, published in ACM MultiMedia 2017. In addition, we provide detail instructions together with pre-configured MATLAB scripts which allow experiments to be executed and to reproduce the results reported in our main manuscript effortlessly. The source code is available at https://github.com/hnanhtuan/selectiveConvFeatures_ACMMM_reproducibility.

Abstract:
Beauty product retrieval has drawn more and more attention for its wide application outlook and enormous economic benefits. However, this task is always challenging due to the variation of products, especially the disturbance of clustered background. In this paper, we first introduce attention mechanism into a global image descriptor, i.e., Maximum Activation of Convolutions (MAC), and propose Attention-based MAC (AMAC). With this enhancement, we can suppress the negative effect of background and highlight the foreground in an unsupervised manner. Then, AMAC and local descriptors are ensembled to complementarily increase the performance. Furthermore, we try to finetune multiple retrieval methods on the different datasets and adopt a query expansion strategy to obtain more improvements. Extensive experiments conducted on a dataset containing more the half million beauty products (Perfect-500K) demonstrate the effectiveness of the proposed method. Finally, our team (USTC-NELSLIP) wins the first place on the leaderboard of the 'AI Meets Beauty'Grand Challenge of ACM Multimedia 2020. The code is available at: https://github.com/gniknoil/Perfect500K-Beauty-Product-Retrieval-Challenge.

Abstract:
Conventional referring expression comprehension (REF) assumes people to query something from an image by describing its visual appearance and spatial location, but in practice, we often ask for an object by describing its affordance or other non-visual attributes, especially when we do not have a precise target. For example, sometimes we say 'Give me something to eat'. In this case, we need to use commonsense knowledge to identify the objects in the image. Unfortunately, there is no existing referring expression dataset reflecting this requirement, not to mention a model to tackle this challenge. In this paper, we collect a new referring expression dataset, called KB-Ref, containing 43k expressions on 16k images. In KB-Ref, to answer each expression (detect the target object referred by the expression), at least one piece of commonsense knowledge must be required. We then test state-of-the-art (SoTA) REF models on KB-Ref, finding that all of them present a large drop compared to their outstanding performance on general REF datasets. We also present an expression conditioned image and fact attention (ECIFA) network that extracts information from correlated image regions and commonsense knowledge facts. Our method leads to a significant improvement over SoTA REF models, although there is still a gap between this strong baseline and human performance. The dataset and baseline models are available at: https://github.com/wangpengnorman/KB-Ref_dataset.

Abstract:
Existing works address the problem of generating high frame-rate sharp videos by separately learning the frame deblurring and frame interpolation modules. Most of these approaches have a strong prior assumption that all the input frames are blurry whereas in a real-world setting, the quality of frames varies. Moreover, such approaches are trained to perform either of the two tasks - deblurring or interpolation - in isolation, while many practical situations call for both. Different from these works, we address a more realistic problem of high frame-rate sharp video synthesis with no prior assumption that input is always blurry. We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos with no prior knowledge of input frames being blurry or not, thereby performing the task of both deblurring and interpolation. We hypothesize that information from the latent representation of the consecutive frames can be utilized to generate optimized representations for both frame deblurring and frame interpolation. Specifically, we employ combination of self-attention and cross-attention module between consecutive frames in the latent space to generate optimized representation for each frame. The optimized representation learnt using these attention modules help the model to generate and interpolate sharp frames. Extensive experiments on standard datasets demonstrate that our method performs favorably against various state-of-the-art approaches, even though we tackle a much more difficult problem. The project page is available at https://agupt013.github.io/ALANET.html.

Abstract:
Existing person re-identification methods rely on the visual sensor to capture the pedestrians. The image or video data from visual sensor inevitably suffers the occlusion and dramatic variations of pedestrian postures, which degrades the re-identification performance and further limits its application to the open environment. On the other hand, for most people, one of the most important carry-on items is the mobile phone, which can be sensed by WiFi and cellular networks in the form of a wireless positioning signal. Such signal is robust to the pedestrian occlusion and visual appearance change, but suffers some positioning error. In this work, we approach person re-identification with the sensing data from both vision and wireless positioning. To take advantage of such cross-modality cues, we propose a novel recurrent context propagation module that enables information to propagate between visual data and wireless positioning data and finally improves the matching accuracy. To evaluate our approach, we contribute a new Wireless Positioning Person Re-identification (WP-ReID) dataset. Extensive experiments are conducted and demonstrate the effectiveness of the proposed algorithm. Code will be released at https://github.com/yolomax/WP-ReID.

Abstract:
As a very important research issue in digital media art, neural learning based video style transfer has attracted more and more attention. A lot of recent works import optical flow method to original image style transfer framework to preserve frame-coherency and prevent flicker. However, these methods highly rely on paired video datasets of content video and stylized video, which are often difficult to obtain. Another limitation of existing methods is that while maintaining inter-frame coherency, they will introduce strong ghosting artifacts. In order to address these problems, this paper has following contributions: (1).presents a novel training framework for video style transfer without dependency on video dataset of target style; (2).firstly focuses on the ghosting problem existing in most previous works and uses partial convolution-based strategy to utilize inter-frame context and correlation, together with additional depth loss as a constrain to the generated frames to suppress ghosting artifacts and preserve stability at the same time. Extensive experiments demonstrate that our method can produce natural and stable video frames with target style. Qualitative and quantitative comparisons also show that the proposed approach outperforms previous works in terms of overall image quality and inter-frame stability. To facilitate future research, we publish our experiment code at \urlhttps://github.com/Huage001/Artistic-Video-Partial-Conv-Depth-Loss.

Abstract:
Estimating the 3D hand pose from a monocular RGB image is important but challenging. A solution is training on large-scale RGB hand images with accurate 3D hand keypoint annotations. However, it is too expensive in practice. Instead, we develop a learning-based approach to synthesize realistic, diverse, and 3D pose-preserving hand images under the guidance of 3D pose information. We propose a 3D-aware multi-modal guided hand generative network (MM-Hand), together with a novel geometry-based curriculum learning strategy. Our extensive experimental results demonstrate that the 3D-annotated images generated by MM-Hand qualitatively and quantitatively outperform existing options. Moreover, the augmented data can consistently improve the quantitative performance of the state-of-the-art 3D hand pose estimators on two benchmark datasets. The code will be available at https://github.com/ScottHoang/mm-hand.

Abstract:
Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on https://github.com/xbq1994/.

Abstract:
Currently, most image quality assessment (IQA) models are supervised by the MAE or MSE loss with empirically slow convergence. It is well-known that normalization can facilitate fast convergence. Therefore, we explore normalization in the design of loss functions for IQA. Specifically, we first normalize the predicted quality scores and the corresponding subjective quality scores. Then, the loss is defined based on the norm of the differences between these normalized values. The resulting "Norm-in-Norm" loss encourages the IQA model to make linear predictions with respect to subjective quality scores. After training, the least squares regression is applied to determine the linear mapping from the predicted quality to the subjective quality. It is shown that the new loss is closely connected with two common IQA performance criteria (PLCC and RMSE). Through theoretical analysis, it is proved that the embedded normalization makes the gradients of the loss function more stable and more predictable, which is conducive to the faster convergence of the IQA model. Furthermore, to experimentally verify the effectiveness of the proposed loss, it is applied to solve a challenging problem: quality assessment of in-the-wild images. Experiments on two relevant datasets (KonIQ-10k and CLIVE) show that, compared to MAE or MSE loss, the new loss enables the IQA model to converge about 10 times faster and the final model achieves better performance. The proposed model also achieves state-of-the-art prediction performance on this challenging problem. For reproducible scientific research, our code is publicly available at \urlhttps://github.com/lidq92/LinearityIQA.

Abstract:
We address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need of precise annotations of the gaze angle and the head pose. We created a new dataset called CelebAGaze consisting of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our method consists of three novel modules: the Gaze Correction module(GCM), the Gaze Animation module(GAM), and the Pretrained Autoencoder module (PAM). Specifically, GCM and GAM separately train a dual in-painting network using data from the domain X for gaze correction and data from the domain Y for gaze animation. Additionally, a Synthesis-As-Training method is proposed when training GAM to encourage the features encoded from the eye region to be correlated with the angle information, resulting in gaze animation achieved by interpolation in the latent space. To further preserve the identity information e.g., eye shape, iris color, we propose the PAM with an Autoencoder, which is based on Self-Supervised mirror learning where the bottleneck features are angle-invariant and which works as an extra input to the dual in-painting models. Extensive experiments validate the effectiveness of the proposed method for gaze correction and gaze animation in the wild and demonstrate the superiority of our approach in producing more compelling results than state-of-the-art baselines. Our code, the pretrained models and supplementary results are available at:https://github.com/zhangqianhui/GazeAnimation.

Abstract:
Recently, facial expression recognition (FER) in the wild has gained a lot of researchers' attention because it is a valuable topic to enable the FER techniques to move from the laboratory to the real applications. In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. First, we present a new large-scale 'in-the-wild' dynamic facial expression database, DFEW (Dynamic Facial Expression in the Wild), consisting of over 16,000 video clips from thousands of movies. These video clips contain various challenging interferences in practical scenarios such as extreme illumination, occlusions, and capricious pose changes. Second, we propose a novel method called Expression-Clustered Spatiotemporal Feature Learning (EC-STFL) framework to deal with dynamic FER in the wild. Third, we conduct extensive benchmark experiments on DFEW using a lot of spatiotemporal deep feature learning methods as well as our proposed EC-STFL. Experimental results show that DFEW is a well-designed and challenging database, and the proposed EC-STFL can promisingly improve the performance of existing spatiotemporal deep neural networks in coping with the problem of dynamic FER in the wild. Our DFEW database is publicly available and can be freely downloaded from https://dfew-dataset.github.io/.

Abstract:
Person re-identification has seen significant advancement in recent years. However, the ability of learned models to generalize to unknown target domains still remains limited. One possible reason for this is the lack of large-scale and diverse source training data, since manually labeling such a dataset is very expensive and privacy sensitive. To address this, we propose to automatically synthesize a large-scale person re-identification dataset following a set-up similar to real surveillance but with virtual environments, and then use the synthesized person images to train a generalizable person re-identification model. Specifically, we design a method to generate a large number of random UV texture maps and use them to create different 3D clothing models. Then, an automatic code is developed to randomly generate various different 3D characters with diverse clothes, races and attributes. Next, we simulate a number of different virtual environments using Unity3D, with customized camera networks similar to real surveillance systems, and import multiple 3D characters at the same time, with various movements and interactions along different paths through the camera networks. As a result, we obtain a virtual dataset, called RandPerson, with 1,801,816 person images of 8,000 identities. By training person re-identification models on these synthesized person images, we demonstrate, for the first time, that models trained on virtual data can generalize well to unseen target images, surpassing the models trained on various real-world datasets, including CUHK03, Market-1501, DukeMTMC-reID, and almost MSMT17. The RandPerson dataset is available at https://github.com/VideoObjectSearch/RandPerson.

Abstract:
Beauty and personal care product retrieval (BPCR) aims to match a query image of an item to examples of the same item in a large database. The task is extremely challenging because a small number of ground-truth examples have to be found in a large search space. Previous works mostly search only with visual representations and have not made full use of the product descriptions. Since many noisy examples only have subtle visual differences comparing to the ground-truth examples (e.g. similar packaging but different brands) and those differences (e.g. product brands) are especially hard to be captured only by visual features, methods merely based on visual feature similarities can easily regard those noisy examples as examples of the same item in the query image. We notice that the product descriptions are good sources for capturing those subtle visual differences. Therefore, we propose a search method utilizing both images and product descriptions in this work. Before searching, we not only prepare attention-based visual features for each database image but also a textual index (TI) that matches each database example to other examples with similar product descriptions. During searching, the visual feature of the query image is firstly searched in the whole database and then searched in a subset obtained by looking up the TI. Finally, the second result is used to refine the initial result. Since the subset examples usually have similar properties (e.g. brands and type), the noisy examples in the initial result can be effectively replaced. We have experimentally proved the effectiveness of the proposed method on the validation set of the Perfect-500K dataset. Our team (NTU-Beauty) achieved the 3rd place in the leader board of the Grand Challenge of AI Meets Beauty in ACM Multimedia 2020. Our code is available at: https://github.com/jingwenh/2020-ai-meets-beauty_ntubeauty.git.

Abstract:
Security inspection often deals with a piece of baggage or suitcase where objects are heavily overlapped with each other, resulting in an unsatisfactory performance for prohibited items detection in X-ray images. In the literature, there have been rare studies and datasets touching this important topic. In this work, we contribute the first high-quality object detection dataset for security inspection, named Occluded Prohibited Items X-ray (OPIXray) image benchmark. OPIXray focused on the widely-occurred prohibited item "cutter", annotated manually by professional inspectors from the international airport. The test set is further divided into three occlusion levels to better understand the performance of detectors. Furthermore, to deal with the occlusion in X-ray images detection, we propose the De-occlusion Attention Module (DOAM), a plug-and-play module that can be easily inserted into and thus promote most popular detectors. Despite the heavy occlusion in X-ray imaging, shape appearance of objects can be preserved well, and meanwhile different materials visually appear with different colors and textures. Motivated by these observations, our DOAM simultaneously leverages the different appearance information of the prohibited item to generate the attention map, which helps refine feature maps for the general detectors. We comprehensively evaluate our module on the OPIXray dataset, and demonstrate that our module can consistently improve the performance of the state-of-the-art detection methods such as SSD, FCOS, etc, and significantly outperforms several widely-used attention mechanisms. In particular, the advantages of DOAM are more significant in the scenarios with higher levels of occlusion, which demonstrates its potential application in real-world inspections. The OPIXray benchmark and our model are released at https://github.com/OPIXray-author/OPIXray.

Abstract:
Structured information extraction from document images usually consists of three steps: text detection, text recognition, and text field labeling. While text detection and text recognition have been heavily studied and improved a lot in literature, text field labeling is less explored and still faces many challenges. Existing learning based methods for text labeling task usually require a large amount of labeled examples to train a specific model for each type of document. However, collecting large amounts of document images and labeling them is difficult and sometimes impossible due to privacy issues. Deploying separate models for each type of document also consumes a lot of resources. Facing these challenges, we explore one-shot learning for the text field labeling task. Existing one-shot learning methods for the task are mostly rule-based and have difficulty in labeling fields in crowded regions with few landmarks and fields consisting of multiple separate text regions. To alleviate these problems, we proposed a novel deep end-to-end trainable approach for one-shot text field labeling, which makes use of attention mechanism to transfer the layout information between document images. We further applied conditional random field on the transferred layout information for the refinement of field labeling. We collected and annotated a real-world one-shot field labeling dataset with a large variety of document types and conducted extensive experiments to examine the effectiveness of the proposed model. To stimulate research in this direction, the collected dataset and the one-shot model will be released (https://github.com/AlibabaPAI/one_shot_text_labeling).

Abstract:
Analyzing the interactions between humans and objects from a video includes identification of the relationships between humans and the objects present in the video. It can be thought of as a specialized version of Visual Relationship Detection, wherein one of the objects must be a human. While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. Unlike current approaches, LIGHTEN avoids using ground truth data like depth maps or 3D human pose, thus increasing generalization across non-RGBD datasets as well. Furthermore, we achieve the same using only the visual features, instead of the commonly used hand-crafted spatial features. We achieve state-of-the-art results in human-object interaction detection (88.9% and 92.6%) and anticipation tasks of CAD-120 and competitive results on image based HOI detection in V-COCO dataset, setting a new benchmark for visual features based approaches. Code for LIGHTEN is available at https://github.com/praneeth11009/LIGHTEN-Learning-Interactions-with-Graphs-and-Hierarchical-TEmporal-Networks-for-HOI

Abstract:
Face alignment is an important task in the field of multi-media. Together with the impressive progress of algorithms, various benchmark datasets have been released in recent years. Intuitively, it is meaningful to integrate multiple labeled datasets with different annotations to achieve higher performance on a target landmark detector. Although numerous efforts have been made in joint usage, there yet remain three shortages in recent works, e.g., additional computation, limitation of the markups scheme, and limited support for the regression method. To address the above problems, we proposed a novel Alternating Training Framework (ATF), which leverages similarity and diversity across multi-media sources for a more robust detector. Our framework mainly contains two sub-modules: Alternating Training with Decreasing Proportions (ATDP) and Mixed Branch Loss (mathcal LMB). In particular, ATDP trains multiple datasets simultaneously to take advantage of the diversity between them, while mathcal LMB utilizes similar landmark pairs to constrain different branches of corresponding datasets. Extensive experiments on various benchmarks show the effectiveness of our framework, and ATF is feasible for both heatmap-based network and direct coordinate regression. Specifically, the mean error even reaches 3.17 on the experiment on 300W leveraging WFLW, which significantly outperforms state-of-the-art methods. Both in an ordinary convolutional network (OCN) and HRNET, ATF achieves up to 9.96% relative improvement. Our source codes are made publicly available at https://github.com/starhiking/ATF.

Abstract:
Despite significant progress of applying deep learning methods to the field of content-based image retrieval, there has not been a software library that covers these methods in a unified manner. In order to fill this gap, we introduce PyRetri, an open source library for deep learning based unsupervised image retrieval. The library encapsulates the retrieval process in several stages and provides functionality that covers various prominent methods for each stage. The idea underlying its design is to provide a unified platform for deep learning based image retrieval research, with high usability and extensibility. The project source code, with usage examples, sample data and pre-trained models are available at https://github.com/PyRetri/.

Abstract:
Deep neural network with multi-scale feature fusion has achieved great success in human pose estimation. However, drawbacks still exist in these methods: 1) they consider multi-scale features equally, which may over-emphasize redundant features; 2) preferring deeper structures, they can learn features with the strong semantic representation, but tend to lose natural discriminative information; 3) to attain good performance, they rely heavily on pretraining, which is time-consuming, or even unavailable practically. To mitigate these problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). Meanwhile, focusing on fusing features both selectively and comprehensively, PGA-Net can demonstrate remarkable stability and encouraging performance even without pre-training, making the model can be trained truly from scratch. We demonstrate the effectiveness of PGA-Net through validating on COCO and MPII benchmarks, attaining new state-of-the-art performance. https://github.com/ssr0512/PGA-Net

Abstract:
The semantic SLAM (simultaneous localization and mapping) system is an indispensable module for autonomous indoor parking. Monocular and binocular visual cameras constitute the basic configuration to build such a system. Features used in existing SLAM systems are often dynamically movable, blurred and repetitively textured. By contrast, semantic features on the ground are more stable and consistent in the indoor parking environment. Due to their inabilities to perceive salient features on the ground, existing SLAM systems are prone to tracking loss during navigation. Therefore, a surround-view camera system capturing images from a top-down viewpoint is necessarily called for. To this end, this paper proposes a novel tightly-coupled semantic SLAM system by integrating Visual, Inertial, and Surround-view sensors, VIS SLAM for short, for autonomous indoor parking. In VIS SLAM, apart from low-level visual features and IMU (inertial measurement unit) motion data, parking-slots in surround-view images are also detected and geometrically associated, forming semantic constraints. Specifically, each parking-slot can impose a surround-view constraint that can be split into an adjacency term and a registration term. The former pre-defines the position of each individual parking-slot subject to whether it has an adjacent neighbor. The latter further constrains by registering between each observed parking-slot and its position in the world coordinate system. To validate the effectiveness and efficiency of VIS SLAM, a large-scale dataset composed of synchronous multi-sensor data collected from typical indoor parking sites is established, which is the first of its kind. The collected dataset has been made publicly available at https://cslinzhang.github.io/VISSLAM/.

Abstract:
Binary and quantized neural networks are a promising technique to run convolutional neural networks on mobile or embedded devices. BMXNet 2 is an open-source framework that provides a broad basis for academia and industry. It provides a modern implementation of binary and quantized layers with a wide array of implemented state-of-the-art models. Our implementation fosters reproducibility of other works and our own work through publishing model code, hyperparameters, detailed model graphs, and training logs. Furthermore, we implement several applications for BNNs, including demo applications, which can run on a smartphone or a Raspberry Pi. The code can be found online: https://github.com/hpi-xnor/BMXNet-v2

Abstract:
In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature. It aims to actively perform as few action steps as possible to search and locate the target object in a 3D indoor scene. Different from classic object detection that passively receives visual information, this task encourages an intelligent agent to perform active search via reasonable action planning; thus it can better recall the target objects, especially for the challenging situations that the target is far from the agent, blocked by an obstacle and out of view. To handle this AOS task, we formulate a reinforcement learning framework that consists of a 3D object detector, a state controller and a cross-modal action planner to work cooperatively to find out the target object with minimal action steps. During training, we design a novel cost-sensitive active search reward that penalizes inaccurate object search and redundant action steps. To evaluate this novel task, we construct an Active Object Search (AOS) benchmark that contains 5,845 samples from 30 diverse indoor scenes. We conduct extensive qualitative and quantitative evaluations on this benchmark to demonstrate the effectiveness of the proposed approach and analyze the key factors that contribute more to address this task.

Abstract:
While widely adopted in practical applications, face recognition has been critically discussed regarding the malicious use of face images and the potential privacy problems, e.g., deceiving payment system and causing personal sabotage. Online photo sharing services unintentionally act as the main repository for malicious crawler and face recognition applications. This work aims to develop a privacy-preserving solution, called Adversarial Privacy-preserving Filter (APF), to protect the online shared face images from being maliciously used. We propose an end-cloud collaborated adversarial attack solution to satisfy requirements of privacy, utility and non-accessibility. Specifically, the solutions consist of three modules: (1) image-specific gradient generation, to extract image-specific gradient in the user end with a compressed probe model; (2) adversarial gradient transfer, to fine-tune the image-specific gradient in the server cloud; and (3) universal adversarial perturbation enhancement, to append image-independent perturbation to derive the final adversarial noise. Extensive experiments on three datasets validate the effectiveness and efficiency of the proposed solution. A prototype application is also released for further evaluation. We hope the end-cloud collaborated attack framework could shed light on addressing the issue of online multimedia sharing privacy-preserving issues from user side.

Abstract:
Image compression, as one of the fundamental low-level image processing tasks, is very essential for computer vision. Tremendous computing and storage resources can be preserved with a trivial amount of visual information. Conventional image compression methods tend to obtain compressed images by minimizing their appearance discrepancy with the corresponding original images, but pay little attention to their efficacy in downstream perception tasks, e.g., image recognition and object detection. Thus, some of compressed images could be recognized with bias. In contrast, this paper aims to produce compressed images by pursuing both appearance and perceptual consistency. Based on the encoder-decoder framework, we propose using a pre-trained CNN to extract features of the original and compressed images, and making them similar. Thus the compressed images are discernible to subsequent tasks, and we name our method as Discernible Image Compression (DIC). In addition, the maximum mean discrepancy (MMD) is employed to minimize the difference between feature distributions. The resulting compression network can generate images with high image quality and preserve the consistent perception in the feature domain, so that these images can be well recognized by pre-trained machine learning models. Experiments on benchmarks demonstrate that images compressed by using the proposed method can also be well recognized by subsequent visual recognition and detection models. For instance, the mAP value of compressed images by DIC is about 0.6% higher than that of using compressed images by conventional methods.

Abstract:
We present deep shapely portraits, a novel method based on deep learning, to automatically reshape an input portrait to be better proportioned and more shapely while keeping personal facial characteristics. Different from existing methods that may suffer from irrational face artifacts when dealing with portraits with large pose variations or reshaping adjustments, we utilize dense 3D face information and constraints instead of sparse facial landmarks based on 3D morphable models, resulting in better reshaped faces lying in rational face space. To this end, we first estimate the best shapely degree for the input portrait using a convolutional neural network (CNN) trained on our newly developed ShapeFaceNet dataset. Then the best shapely degree is used as the control parameter to reshape the 3D face reconstructed from the input portrait image. After that, we render the reshaped 3D face back to 2D and generate a seamless portrait image using a fast image warping optimization. Our work can deal with pose and expression free (PE-Free) portrait images and generate plausible shapely faces without noticeable artifacts, which cannot be achieved by prior work. We validate the effectiveness, efficiency, and robustness of the proposed method by extensive experiments and user studies.

Abstract:
Image retrieval is a long-standing topic in the multimedia community due to its various applications, e.g., product search and artworks retrieval in museum. The regions in images contain a wealth of information. Users may be interested in the objects presented in the image regions or the relationships between them. But previous retrieval methods are either limited to the single object of images, or tend to the entire visual scene. In this paper, we introduce a new task called expressional region retrieval, in which the query is formulated as a region of image with the associated description. The goal is to find images containing the similar content with the query and localize the regions within them. As far as we know, this task has not been explored yet. We propose a framework to address this issue. The region proposals are first generated based on region detectors and language features are extracted. Then the Gated Residual Network (GRN) takes language information as a gate to control the transformation of visual features. In this way, the combined visual and language representation is more specific and discriminative for expressional region retrieval. We evaluate our method on a new established benchmark which is constructed based on the Visual Genome dataset. Experimental results demonstrate that our model effectively utilizes both visual and language information, outperforming the baseline methods.

Abstract:
In this paper, we aim to understand the functionality of 2D sketches by predicting how humans would interact with the objects depicted by sketches in real life. Given a 2D sketch, we learn to predict a tactile saliency map for it, which represents where humans would grasp, press, or touch the object depicted by the sketch. We hypothesize that understanding 3D structure and category of the sketched object would help such tactile saliency reasoning. We thus propose to jointly predict the tactile saliency, depth map and semantic category of a sketch in an end-to-end learning-based framework. To train our model, we propose to synthesize training data by leveraging a collection of 3D shapes with 3D tactile saliency information. Experiments show that our model can predict accurate and plausible tactile saliency maps for both synthetic and real sketches. In addition, we also demonstrate that our predicted tactile saliency is beneficial to sketch recognition and sketch-based 3D shape retrieval, and enables us to establish part-based functional correspondences among sketches.

Abstract:
We present the problem of Visually Precise Query (VPQ) generation which enables a more intuitive match between a user's information need and an e-commerce site's product description. Given an image of a fashion item, what is the most optimum search query that will retrieve the exact same or closely related product(s) with high probability. In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. We collect a large dataset of fashion images and their titles and merge it with an existing research dataset which was created for a different task. Given the image and title pair, VPQ problem is posed as identifying a non-contiguous collection of spans within the title. We provide a dataset of around 400K image, title and corresponding VPQ entries and release it to the research community. We provide a detailed description of the data collection process as well as discuss the future direction of research for the problem introduced in this work. We provide the standard text as well as visual domain baseline comparisons and also provide multi-modal baseline models to analyze the task introduced in this work. Finally, we propose a hybrid fusion model which promises to be the direction of research in the multi-modal community.

Abstract:
Recovering a 3D shape representation from one single image input has been attempted in recent years. Most of the works obtain 3D models from multiple images at different perspectives or ground truth CAD models. However, multiple images from different perspectives or 3D CAD models are not always available in real applications. In this work, we present a novel shape-from-silhouette method based on just a single image, which is an end-to-end learning framework relying on view synthesis and shape-from-silhouette methodology to reconstruct a 3D shape. The reconstructed 3D mesh can approach the real shape of target objects by constraining the silhouettes from both horizontal and vertical directions, especially for those objects with occlusions. Our proposed method achieves state-of-the-art performance on the ShapeNet dataset compared with other recent approaches targeting 3D reconstruction from a single image. Without requiring labor-intensive and time-consuming human annotations, the work has a broad potential to be applied in real-world applications.

Abstract:
Given a partially masked image, image inpainting aims to complete the missing region and output a plausible image. Most existing image inpainting methods complete the missing region by expanding or borrowing information from the surrounding source region, which work well when the original content in the missing region is similar to the surrounding source region. Unsatisfactory results will be generated if there is no sufficient contextual information can be referenced from source region. Besides, the inpainting results should be diverse and this kind of diversity should be controllable. Based on these observations, we propose a new inpainting problem that introduces text as a kind of guidance to direct and control the inpainting process. The main difference between this problem and previous works is that we need ensure the result to be consistent with not only the source region but also the textual guidance during inpainting. By this way, we want to avoid the unreasonable completion and meanwhile make it controllable. We propose a progressively coarse-to-fine cross-modal generative network and adopt the text-image-text training schema to generate visually consistent and semantically coherent images. Extensive quantitative and qualitative experiments on two public datasets with captions demonstrate the effectiveness of our method.

Abstract:
In this work, we introduce an important but still unexplored research task -- image sentiment transfer. Compared with other related tasks that have been well-studied, such as image-to-image translation and image style transfer, transferring the sentiment of an image is more challenging. Given an input image, the rule to transfer the sentiment of each contained object can be completely different, making existing approaches that perform global image transfer by a single reference image inadequate to achieve satisfactory performance. In this paper, we propose an effective and flexible framework that performs image sentiment transfer at the object level. It first detects the objects and extracts their pixel-level masks, and then performs object-level sentiment transfer guided by multiple reference images for the corresponding objects. For the core object-level sentiment transfer, we propose a novel Sentiment-aware GAN (SentiGAN). Both global image-level and local object-level supervisions are imposed to train SentiGAN. More importantly, an effective content disentanglement loss cooperating with a content alignment step is applied to better disentangle the residual sentiment-related information of the input image. Extensive quantitative and qualitative experiments are performed on the object-oriented VSO dataset we create, demonstrating the effectiveness of the proposed framework.

Abstract:
Food is central to life. Food provides us with energy and foundational building blocks for our body and is also a major source of joy and new experiences. A significant part of the overall economy is related to food. Food science, distribution, processing, and consumption have been addressed by different communities using silos of computational approaches. In this paper, we adopt a person-centric multimedia and multimodal perspective on food computing and show how multimedia and food computing are synergistic and complementary. Enjoying food is a truly multimedia experience involving sight, taste, smell, and even sound, that can be captured using a multimedia food logger. The biological response to food can be captured using multimodal data streams using available wearable devices. Central to this approach is the Personal Food Model. Personal Food Model is the digitized representation of the food-related characteristics of an individual. It is designed to be used in food recommendation systems to provide eating-related recommendations that improve the user's quality of life. To model the food-related characteristics of each person, it is essential to capture their food-related enjoyment using a Preferential Personal Food Model and their biological response to food using their Biological Personal Food Model. Inspired by the power of 3-dimensional color models for visual processing, we introduce a 6-dimensional taste-space for capturing culinary characteristics as well as personal preferences. We use event mining approaches to relate food with other life and biological events to build a predictive model that could also be used effectively in emerging food recommendation systems.

Abstract:
Logging what we eat is important for individuals and the aggregated information in these logs are important for businesses as well as public health. Food logging has received very little attention and has been mostly limited only to the recognition of food items ignoring context, situation, and health variable completely. In this demo we let the audience interact with our multimedia food logger system which is described in the following. We also describe how this system captures the major food-related information that could be used by all stakeholders in the food ecosystem. We will demonstrate the complete functionality of such a system in this demo.

Abstract:
Human keypoint detection is a challenging task, especially under blurry and crowded conditions. However, the existing network for human keypoint detection has become increasingly deeper. When backpropagating, the final supervision information of the network often cannot effectively guide the training of the entire network. Therefore, how to guide the deep network to train effectively is a subject worth discussing. In this paper, the knowledge distillation method is used to make the network predictions results act as supervision information, then a multi-stage supervision training framework is designed from shallow to deep layers. Besides, to further improve the feature expression ability and enhance the receptive field of the network, we also design a new convolution module, which can model the channel and spatial features separately. Finally, our method increased from AP49 to AP55 on the HiEve human keypoint detection dataset[1], which demonstrates the superior performance and effectiveness of our method.

Abstract:
Digital humans find their applications in areas such as virtual companion, virtual reporter, and virtual narrator. As the global trend of digitalization continues, the value of digital humans continues to increase. For example, a virtual teacher may mimic human teachers to deliver personalized education to students spread all over the world at a lower cost. There are many technical difficulties yet to be solved to make digital humans truly valuable. In this talk, I report our recent progresses on addressing two of these difficulties: multi-modal text-to-speech synthesis and multi-modal voice separation and recognition. To address the multi-modal text-to-speech synthesis problem, we developed the duration informed attention network (DurIAN) [1]. DurIAN enhanced the attention-based alignment in the state-of-the-art (SOTA) end-to-end speech synthesis systems such as Tacotron2 [2] with duration information estimated from the rich text input. This technology, while generating high quality natural speech, avoids popular pitfalls such as word repetition and missing in the pure end-to-end systems. More importantly, the system can easily align the facial representation and synthesized speech through the duration model. To more robustly drive the facial expression and mouth movement, we developed a 3D-model guided framework for multi-modal synthesis. To solve the multi-modal voice separation and recognition problem, which is in need in many scenarios such as virtual receptionist, we developed an all deep learning beamformer [3] which integrates the conventional minimum variance distortionless response (MVDR) beamformer, the recurrent neural network-based statistics estimator, and the visual cue guided speaker tracing and diarization system [4]. Our novel approach significantly improved the quality of the separated speech.

Abstract:
This paper presents a neural network model to generate virtual violinist's 3-D skeleton movements from music audio. Improved from the conventional recurrent neural network models for generating 2-D skeleton data in previous works, the proposed model incorporates an encoder-decoder architecture, as well as the self-attention mechanism to model the complicated dynamics in body movement sequences. To facilitate the optimization of self-attention model, beat tracking is applied to determine effective sizes and boundaries of the training examples. The decoder is accompanied with a refining network and a bowing attack inference mechanism to emphasize the right-hand behavior and bowing attack timing. Both objective and subjective evaluations reveal that the proposed model outperforms the state-of-the-art methods. To the best of our knowledge, this work represents the first attempt to generate 3-D violinists? body movements considering key features in musical body movement.

Abstract:
Object contour detection is the fundamental and preprocessing step for multimedia applications such as icon generation, object segmentation, and tracking. The quality of contour prediction is of great importance in these applications since it affects the subsequent process. In this work, we aim to develop a high-performance contour detection system. We first propose a novel yet very effective loss function for contour detection. The proposed loss function is capable of penalizing the distance of contour-structure similarity between each pair of prediction and ground-truth. Moreover, to better distinguishing object contours and background textures, we introduce a novel convolutional encoder-decoder network. Within the network, we present a hyper module that captures dense connections among high-level features and produces effective semantic information. Then the information is progressively propagated and fused with low-level features. We conduct extensive experiments on the BSDS500 and Multi-Cue datasets, the results show significant improvement against the state-of-the-art competitors. We further demonstrate the benefit of our DSCD method for crowd counting.

Abstract:
In many intelligent systems, a network of agents collaboratively perceives the environment for better and more efficient situation awareness. As these agents often have limited resources, it could be greatly beneficial to identify the content overlapping among camera views from different agents and leverage it for reducing the processing, transmission and storage of redundant/unimportant video frames. This paper presents a consensus-based distributed multi-agent video fast-forwarding framework, named DMVF, that fast-forwards multi-view video streams collaboratively and adaptively. In our framework, each camera view is addressed by a reinforcement learning based fast-forwarding agent, which periodically chooses from multiple strategies to selectively process video frames and transmits the selected frames at adjustable paces. During every adaptation period, each agent communicates with a number of neighboring agents, evaluates the importance of the selected frames from itself and those from its neighbors, refines such evaluation together with other agents via a system-wide consensus algorithm, and uses such evaluation to decide their strategy for the next period. Compared with approaches in the literature on a real-world surveillance video dataset VideoWeb, our method significantly improves the coverage of important frames and also reduces the number of frames processed in the system.

Abstract:
In this paper, we introduce an interactive background music synthesis algorithm guided by visual content. We leverage a cascading strategy to synthesize background music in two stages: Scene Visual Analysis and Background Music Synthesis. First, seeking a deep learning-based solution, we leverage neural networks to analyze the sentiment of the input scene. Second, real-time background music is synthesized by optimizing a cost function that guides the selection and transition of music clips to maximize the emotion consistency between visual and auditory criteria, and music continuity. In our experiments, we demonstrate the proposed approach can synthesize dynamic background music for different types of scenarios. We also conducted quantitative and qualitative analysis on the synthesized results of multiple example scenes to validate the efficacy of our approach.

Abstract:
Recently, deep Siamese matching networks have attracted increasing attention for visual tracking. Despite the demonstrated successes, Siamese trackers do not take full advantage of the structural information of target objects. They tend to drift in the presence of non-rigid deformation or partly occlusion. In this paper, we propose to advance Siamese trackers with graph convolutional networks, which pay more attention to the structural layout of target objects, to learn features robust to large appearance changes over time. Specifically, we divide the target object into several sub-parts and design an attentive graph convolutional network to model the relationship between parts. We incrementally update the attention coefficients of the graph with the attention scheme at each frame in an end-to-end manner. To further improve localization accuracy, we propose a learnable cascade regression algorithm based on deep reinforcement learning to refine the predicted bounding boxes. Extensive experiments on seven challenging benchmark datasets, i.e., OTB-100, TC-128, VOT2018, VOT2019, TrackingNet, GOT-10k and LaSOT, demonstrate that the proposed tracking method performs favorably against state-of-the-art approaches.

Abstract:
In this work, we present interpGaze, a novel framework for controllable gaze redirection that achieves both precise redirection and continuous interpolation. Given two gaze images with different attributes, our goal is to redirect the eye gaze of one person into any gaze direction depicted in the reference image or to generate continuous intermediate results. To accomplish this, we design a model including three cooperative components: an encoder, a controller and a decoder. The encoder maps images into a well-disentangled and hierarchically-organized latent space. The controller adjusts the magnitudes of latent vectors to the desired strength of corresponding attributes by altering a control vector. The decoder converts the desired representations from the attribute space to the image space. To facilitate covering the full space of gaze directions, we introduce a high-quality gaze image dataset with a large range of directions, which also benefits researchers in related areas. Extensive experimental validation and comparisons to several baseline methods show that the proposed interpGaze outperforms state-of-the-art methods in terms of image quality and redirection precision.

Abstract:
Crowd counting has attracted increasing attention due to its wide application prospect. One of the most essential challenge in this domain is large scale variation, which impacts the accuracy of density estimation. To this end, we propose a scale-aware progressive optimization network (SPO-Net) for crowd counting, which trains a scale adaptive network to achieve high-quality density map estimation and overcome the variable scale dilemma in highly congested scenes. Concretely, the first phase of SPO-Net, band-pass stage, mainly concentrates on preprocessesing the input image and fusing both high-level semantic information and low-level spatial information from separated multi-layer features. And the second phase of SPO-Net, rolling guidance stage, aims to learn a scale-adapted network from multi-scale features as well as rolling training manner. For better learning local correlation of multi-size regions and reducing redundant calculations, we introduce a progressive optimization strategy. Extensive experiments on three challenging crowd counting datasets not only demonstrate the efficacy of each part in SPO-Net, but also suggest the superiority of our proposed method compared with the state-of-the-art approaches.

Abstract:
Domain Adaptation (DA) aims at transferring knowledge from a labeled source domain to an unlabeled target domain. While re- markable advances have been witnessed recently, the power of DA methods still heavily depends on the network depth, especially when the domain discrepancy is large, posing an unprecedented challenge to DA in low-resource scenarios where fast and adaptive inference is required. How to bridge transferability and resource- efficient inference in DA becomes an important problem. In this paper, we propose Resource Efficient Domain Adaptation (REDA), a general framework that can adaptively adjust computation re- sources across 'easier' and 'harder' inputs. Based on existing multi- exit architectures, REDA has two novel designs: 1) Transferable distillation to distill the transferability of top classifier into the early exits; 2) Consistency weighting to control the distillation degree via prediction consistency. As a general method, REDA can be easily applied with a variety of DA methods. Empirical results and analy- ses justify that REDA can substantially improve the accuracy and accelerate the inference under domain shift and low resource.

Abstract:
Fast and accurate identification of the co-interest persons, who draw joint interest of the surrounding people, plays an important role in social scene understanding and surveillance. Previous study mainly focuses on detecting co-interest persons from a single-view video. In this paper, we study a much more realistic and challenging problem, namely co-interest person~(CIP) detection from multiple temporally-synchronized videos taken by the complementary and time-varying views. Specifically, we use a top-view camera, mounted on a flying drone at a high altitude to obtain a global view of the whole scene and all subjects on the ground, and multiple horizontal-view cameras, worn by selected subjects, to obtain a local view of their nearby persons and environment details. We present an efficient top- and horizontal-view data fusion strategy to map multiple horizontal views into the global top view. We then propose a spatial-temporal CIP potential energy function that jointly considers both intra-frame confidence and inter-frame consistency, thus leading to an effective Conditional Random Field~(CRF) formulation. We also construct a complementary-view video dataset, which provides a benchmark for the study of multi-view co-interest person detection. Extensive experiments validate the effectiveness and superiority of the proposed method.

Abstract:
As a structured representation of the image content, the visual scene graph (visual relationship) acts as a bridge between computer vision and natural language processing. Existing models on the scene graph generation task notoriously require tens or hundreds of labeled samples. By contrast, human beings can learn visual relationships from a few or even one example. Inspired by this, we design a task named One-Shot Scene Graph Generation, where each relationship triplet (e.g., "dog-has-head'') comes from only one labeled example. The key insight is that rather than learning from scratch, one can utilize rich prior knowledge. In this paper, we propose Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task. Specifically, the Relational Knowledge represents the prior knowledge of relationships between entities extracted from the visual content, e.g., the visual relationships "standing in'', "sitting in'', and "lying in'' may exist between "dog'' and "yard'', while the Commonsense Knowledge encodes "sense-making'' knowledge like "dog can guard yard''. By organizing these two kinds of knowledge in a graph structure, Graph Convolution Networks (GCNs) are used to extract knowledge-embedded semantic features of the entities. Besides, instead of extracting isolated visual features from each entity generated by Faster R-CNN, we utilize an Instance Relation Transformer encoder to fully explore their context information. Based on a constructed one-shot dataset, the experimental results show that our method significantly outperforms existing state-of-the-art methods by a large margin. Ablation studies also verify the effectiveness of the Instance Relation Transformer encoder and the Multiple Structured Knowledge.

Abstract:
Style transfer aims to synthesize an image which inherits the content of one image while preserving a similar style of the other one. The "style'' of an image usually refers to its unique feeling conveyed from visual features, which is highly related to the aesthetic effect of the image. Aesthetic effect can be mainly decomposed as two factors: colour and texture. Previous methods like Neural Style Transfer and Colour Transfer have shown strong abilities in transferring colour and texture features. However, such approaches neglect to further disentangle colour and texture, which makes some of unique aesthetic effects designed by human artists hard to express. In this paper, we propose a novel problem called Aesthetic-Aware Image Style Transfer task, which aims to transfer colour and texture separately and independently to manipulate the aesthetic effect of an image. We propose a novel Aesthetic-Aware Model-Optimisation-Based Style Transfer (AAMOBST) model to solve this problem. Specifically, AAMOBST is a multi-reference, two-path model. It uses different reference images to decide desired colour and texture features. It can segregate colour and texture into two distinct paths and transfer them independently. Qualitative and quantitative experiments show that our model can decide colour and texture features separately and is able to keep one of them fixed while changing the other one, which is not applicable for previous methods. Furthermore, on tasks that are applicable for previous methods (such as style transfer, colour-preserved transfer and colour-only transfer), our model shows comparable abilities with other baseline methods.

Abstract:
The recent works in cross-modal image-to-recipe retrieval pave a new way to scale up food recognition. By learning the joint space between food images and recipes, food recognition is boiled down as a retrieval problem by evaluating the similarity of embedded features. The major drawback, nevertheless, is the difficulty in applying an already-trained model to recognize different cuisines of dishes unknown to the model. In general, model updating with new training examples, in the form of image-recipe pairs, is required to adapt a model to new cooking styles in a cuisine. Nevertheless, in practice, acquiring sufficient number of image-recipe pairs for model transfer can be time-consuming. This paper addresses the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. Partial data refers to missing information such as absence of image modality or cooking instructions from an image-recipe pair. To cope with partial data, a novel generic model, equipped with various loss functions including cross-modal metric learning, recipe residual loss, semantic regularization and adversarial learning, is proposed for cross-domain transfer learning. Experiments are conducted on three different cuisines (Chuan, Yue and Washoku) to provide insights on scaling up food recognition across domains with limited training resources.

Abstract:
Understanding and reasoning over partially observed visual clues are often regarded as a challenging real-world problem even for human beings. In this paper, we present a new visual question answering (VQA) task -- Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream. Photo Stream QA is more challenging and interesting than the existing VQA tasks, since the temporal and visual variance among photos in the stream is huge and hard to observe. Therefore, instead of learning simple vision-text mappings, the AI algorithms must fill these variance gaps with more recollection, reasoning, even the knowledge from our daily experiences. To tackle the problems in Photo Stream QA, we propose an end-to-end baseline (E-TAA) with a novel Experienced Unit (E-unit) and Three-stage Alternating Attention (TAA). E-unit yields a better visual representation which captures the temporal semantic relation among visual clues in the photo stream, while TAA creates three levels of attention that gradually refines visual features by using the textual representation from the question as the guidance. Experimental results on our developed dataset demonstrate that, as the first attempt at the Photo Stream QA task, E-TAA provides promising results outperforming all the other baseline methods.

Abstract:
Customizable makeup transfer, which aims to transfer the makeup from an arbitrary reference face to a source face, is widely demanded in many applications such as short video platforms and online meeting applications. However, existing methods are neither user-friendly nor sufficiently fast. In this demo, we present the first fast makeup transfer system named as Fast Pose and expression robust Spatial-Aware GAN (FPSGAN). With a novel Attentive Makeup Morphing (AMM) module, FPSGAN is robust to face pose and expression. Moreover, it can achieve shade-controllable and partial makeup, improving the system's user-friendliness. In addition, FPSGAN is light-weighted and fast. To sum up, FPSGAN is the first fast customizable makeup transfer system to enable users to beautify themselves as they like.

Abstract:
Food computing applies computational approaches for acquiring and analyzing heterogeneous food data from disparate sources for perception, recognition, retrieval, recommendation, prediction and monitoring of food to address food-related issues in multimedia and beyond. It has received more attention from both academia and industry as one emerging interdiscipline for its various applications, such as improving human health and understanding the culinary culture. Recently, there are more studies on food computing in the multimedia, such as food recognition and multimodal recipe analysis. This tutorial will provide a basic understanding of food computing, and discuss its use in various multimedia tasks, ranging from food recognition, retrieval, recommendation, recipe analysis to cooking behavior understanding. Specifically, we will first introduce food computing, including its method, task and applications. Then we will discuss several typical tasks of food computing in the multimedia including food image recognition, food retrieval and recommendation, multimodal recipe analysis and cooking action anticipation. Finally, we will point out future research directions on food computing in the multimedia.

Abstract:
Deep learning has been successfully developed as a complicated learning process from source inputs to target outputs in presence of multimedia environments. The inference or optimization is performed over an assumed deterministic model with deep structure. A wide range of temporal and spatial data in language and vision are treated as the inputs or outputs to build such a domain mapping for multimedia applications. A systematic and elaborate transfer is required to meet the mapping between source and target domains. Also, the semantic structure in natural language and computer vision may not be well represented or trained in mathematical logic or computer programs. The distribution function in discrete or continuous latent variable model for words, sentences, images or videos may not be properly decomposed or estimated. The system robustness to heterogeneous environments may not be assured. This tutorial addresses the fundamentals and advances in statistical models and neural networks for domain mapping, and presents a series of deep Bayesian solutions including variational Bayes, sampling method, Bayesian neural network, variational auto-encoder (VAE), stochastic recurrent neural network, sequence-to-sequence model, attention mechanism, end-to-end network, stochastic temporal convolutional network, temporal difference VAE, normalizing flow and neural ordinary differential equation. Enhancing the prior/posterior representation is addressed in different latent variable models. We illustrate how these models are connected and why they work for a variety of applications on complex patterns in language and vision. The word, sentence and image embeddings are merged with semantic constraint or structural information. Bayesian learning is formulated in the optimization procedure where the posterior collapse is tackled. An informative latent space is trained to incorporate deep Bayesian learning in various information systems.

Abstract:
Stock price movement and volatility prediction aim to predict stocks' future trends to help investors make sound investment decisions and model financial risk. Companies' earnings calls are a rich, underexplored source of multimodal information for financial forecasting. However, existing fintech solutions are not optimized towards harnessing the interplay between the multimodal verbal and vocal cues in earnings calls. In this work, we present a multi-task solution that utilizes domain specialized textual features and audio attentive alignment for predictive financial risk and price modeling. Our method advances existing solutions in two aspects: 1) tailoring a deep multimodal text-audio attention model, 2) optimizing volatility, and price movement prediction in a multi-task ensemble formulation. Through quantitative and qualitative analyses, we show the effectiveness of our deep multimodal approach.

Abstract:
In pop music, accompaniments are usually played by multiple instruments (tracks) such as drum, bass, string and guitar, and can make a song more expressive and contagious by arranging together with its melody. Previous works usually generate multiple tracks separately and the music notes from different tracks not explicitly depend on each other, which hurts the harmony modeling. To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks. While this greatly improves harmony, unfortunately, it enlarges the sequence length and brings the new challenge of long-term music modeling. We further introduce two new techniques to address this challenge: 1) We model multiple note attributes (e.g., pitch, duration, velocity) of a musical note in one step instead of multiple steps, which can shorten the length of a MuMIDI sequence. 2) We introduce extra long-context as memory to capture long-term dependency in music. We call our system for pop music accompaniment generation as PopMAG. We evaluate PopMAG on multiple datasets (LMD, FreeMidi and CPMD, a private dataset of Chinese pop songs) with both subjective and objective metrics. The results demonstrate the effectiveness of PopMAG for multi-track harmony modeling and long-term context modeling. Specifically, PopMAG wins 42%/38%/40% votes when comparing with ground truth musical pieces on LMD, FreeMidi and CPMD datasets respectively and largely outperforms other state-of-the-art music accompaniment generation models and multi-track MIDI representations in terms of subjective and objective metrics.

Abstract:
Nowadays, digital facial content manipulation has become ubiquitous and realistic with the success of generative adversarial networks (GANs), making face recognition (FR) systems suffer from unprecedented security concerns. In this paper, we investigate and introduce a new type of adversarial attack to evade FR systems by manipulating facial content, called adversarial morphing attack (a.k.a. Amora). In contrast to adversarial noise attack that perturbs pixel intensity values by adding human-imperceptible noise, our proposed adversarial morphing attack works at the semantic level that perturbs pixels spatially in a coherent manner. To tackle the black-box attack problem, we devise a simple yet effective joint dictionary learning pipeline to obtain a proprietary optical flow field for each attack. Our extensive evaluation on two popular FR systems demonstrates the effectiveness of our adversarial morphing attack at various levels of morphing intensity with smiling facial expression manipulations. Both open-set and closed-set experimental results indicate that a novel black-box adversarial attack based on local deformation is possible, and is vastly different from additive noise attacks. The findings of this work potentially pave a new research direction towards a more thorough understanding and investigation of image-based adversarial attacks and defenses.

Abstract:
In this paper, we propose a novel Visual Relation of Interest Detection (VROID) task, which aims to detect visual relations that are important for conveying the main content of an image, motivated from the intuition that not all correctly detected relations are really "interesting" in semantics and only a fraction of them really make sense for representing the image main content. Such relations are named Visual Relations of Interest (VROIs). VROID can be deemed as an evolution over the traditional Visual Relation Detection (VRD) task that tries to discover all visual relations in an image. We construct a new dataset to facilitate research on this new task, named ViROI, which contains 30,120 images each with VROIs annotated. Furthermore, we develop an Interest Propagation Network (IPNet) to solve VROID. IPNet contains a Panoptic Object Detection (POD) module, a Pair Interest Prediction (PaIP) module and a Predicate Interest Prediction (PrIP) module. The POD module extracts instances from the input image and also generates corresponding instance features and union features. The PaIP module then predicts the interest score of each instance pair while the PrIP module predicts that of each predicate for each instance pair. Then the interest scores of instance pairs are combined with those of the corresponding predicates as the final interest scores. All VROI candidates are sorted by final interest scores and the highest ones are taken as final results. We conduct extensive experiments to test effectiveness of our method, and the results show that IPNet achieves the best performance compared with the baselines on visual relation detection, scene graph generation and image captioning.

Abstract:
In contrast to traditional dehazing methods, deep learning based single image dehazing (SID) algorithms have achieved better performances by creating a mapping function from haze to haze-free images. Usually, the images taken from the natural scenes have different haze levels, but deep SID algorithms only process the hazy images as one group. It makes the deep SID algorithms difficult to deal with the image set with some images having specific haze density. In this paper, a Discrete Haze Level Dehazing network (DHL-Dehaze), a very effective method to dehaze multiple different haze level images, is proposed. The proposed approach considers a single image dehazing problem as a multi-domain image-to-image translation, instead of grouping all hazy images into the same domain. DHL-Dehaze provides computational derivation to describe the role of different haze levels for image translation. To verify the proposed approach, we synthesize two largescale datasets with multiple haze level images based on the NYU-Depth and DIML/CVL datasets. The experiments show that DHL-Dehaze can obtain excellent quantitative and qualitative dehazing results, especially when the haze concentration is high.

Abstract:
Hashing has become increasingly important for large-scale image retrieval. Recently, deep supervised hashing has shown promising performance, yet little work has been done under the more realistic unsupervised setting. The most challenging problem in unsupervised hashing methods is the lack of supervised information. Besides, existing methods fail to distinguish image pairs with different similarity degrees, which leads to a suboptimal construction of similarity matrix. In this paper, we propose a simple yet effective unsupervised hashing method, dubbed Deep Unsupervised Hybrid-similarity Hadamard Hashing (DU3H), which tackles these issues in an end-to-end deep hashing framework. DU3H employs orthogonal Hadamard codes to provide auxiliary supervised information in unsupervised setting, which can maximally satisfy the independence and balance properties of hash codes. Moreover, DU3H utilizes both highly and normally confident image pairs to jointly construct a hybrid-similarity matrix, which can magnify the impacts of different pairs to better preserve the semantic relations between images. Extensive experiments conducted on three widely used benchmarks validate the superiority of DU3H.

Abstract:
Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding.

Abstract:
In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. Existing methods for this problem are purely likelihood-based, leading to the spurious correlations and hurt the generalization ability when transferred to out-of-domain downstream tasks. By spurious correlation, we mean that the conditional probability of one token (object or word) given another one can be high (due to the dataset biases) without robust (causal) relationships between them. To mitigate such dataset biases, we propose a Deconfounded Visio-Linguistic Bert framework, abbreviated as DeVLBert, to perform intervention-based learning. We borrow the idea of the backdoor adjustment from the research field of causality and propose several neural-network based architectures for Bert-style out-of-domain pretraining. The quantitative results on three downstream tasks, Image Retrieval (IR), Zero-shot IR, and Visual Question Answering, show the effectiveness of DeVLBert by boosting generalization ability.

Abstract:
When you first encounter a person, a mental image of that person is formed. First impression, an interactive art, is proposed to let AI understand human personality at first glance. The mental image is demonstrated by Beijing opera facial makeups, which shows the character personality with a combination of realism and symbolism. We build Beijing opera facial makeup dataset and semantic dataset of facial features to establish relationships among real faces, personalities and facial makeups. First impression detects faces, recognizes personality from facial appearance and finds the matching Beijing opera facial makeup. Finally, the morphing process from real face to facial makeup is shown to let users enjoy the process of AI understanding personality.

Abstract:
"AI mirror", an interactive art, tends to visualize the self-knowledge mechanism from the AI's perspective, and arouses people's reflection on artificial intelligence. In the first stage of the unconscious imitation, the visual neurons perceive environmental information and mirror neurons imitate human behavior. Then, the language and consciousness are generated from the long term of imitation, denoted as poet and coordinates in an affective space. In the final stage of conscious behavior, an affinity analysis is generated, and the mirror neurons will behave more harmoniously with the user or have the autonomous movements on its own, which evokes the user's reflection on its undiscovered traits.

Abstract:
In this work, we introduce a practical system which synthesizes an appealing image from natural language descriptions such that the generated image should maintain the aesthetic level of photographs. Our proposed method takes the text from the end-users via a user-friendly interface and produces a set of different label maps via the primary generator PG. Then, choosing a subset from the label maps set is performed through the primary aesthetic appreciation PAA. Next, our subset of label maps is fed into the accessory generator AG, which is the state-of-the-art image-to-image translation. Last but not least, our subset of generated images is ranked via the accessory aesthetic appreciation AAA, and the most appealing image is produced.

Abstract:
Event cameras are biologically-inspired sensors that upend the framed, synchronous nature of traditional cameras. Singh et al. proposed a novel sensor design wherein incident light values may be measured directly through continuous integration, with individual pixels' light sensitivity being adjustable in real time, allowing for extremely high frame rate and high dynamic range video capture. Arguing the potential usefulness of this sensor, this paper introduces a system for simulating the sensor's event outputs and pixel firing rate control from 3D-rendered input images.

Abstract:
This paper develops a deep learning model for the beauty product image retrieval problem. The proposed model has two main components- an encoder and a memory. The encoder extracts and aggregates features from a deep convolutional neural network at multiple scales to get feature embeddings. With the use of an attention mechanism and a data augmentation method, it learns to focus on foreground objects and neglect background on images, so can it extract more relevant features. The memory consists of representative states of all database images as its stacks, and it can be updated during training process. Based on the memory, we introduce a distance loss to regularize embedding vectors from the encoder to be more discriminative. Our model is fully end-to-end, requires no manual feature aggregation and post-processing. Experimental results on the Perfect-500K dataset demonstrate the effectiveness of the proposed model with a significant retrieval accuracy.

Abstract:
Successive image compression refers to the process of repeated encoding and decoding of an image. It frequently occurs during sharing, manipulation, and re-distribution of images. While deep learning-based methods have made significant progress for single-step compression, thorough analysis of their performance under successive compression has not been conducted. In this paper, we conduct comprehensive analysis of successive deep image compression. First, we introduce a new observation, instability of successive deep image compression, which is not observed in JPEG, and discuss causes of the instability. Then, we conduct a successive image compression benchmark for the state-of-the-art deep learning-based methods, and analyze the factors that affect the instability in a comparative manner. Finally, we propose a new loss function for training deep compression models, called feature identity loss, to mitigate the instability of successive deep image compression.

Abstract:
The usage of surveillance cameras for video understanding, raises concerns about privacy intrusion recently. This motivates the research community to seek potential alternatives of cameras for emerging multimedia applications. Stepping to this goal, a few researchers have explored the usage of Wi-Fi or Bluetooth sensors to handle action recognition. However, the practical ability of these sensors is limited by their frequency band and deployment inconvenience because of the separate transmitter/receiver architecture. Motivated by the same purpose of reducing privacy issues, we introduce a latest microwave sensor for multi-person action recognition in this paper. The microwave sensor works at 77GHz ~ 80GHz band, and is implemented with both transmitter and receiver inside itself, thus can be easily deployed for action recognition. Although with its advantages, two main challenging issues still remain. One is the difficulty of labelling the invisible signal data with embedding actions. The other is the difficulty of cancelling the environment noise for high-accurate action recognition. To address the challenges, we propose a novel learning framework by designed original loss functions with the considerations on weakly-supervised multi-label learning and attention mechanism to improve the accuracy for action recognition. We build a new microwave sensor data set, and conduct comprehensive experiments to evaluate the recognition accuracy of our proposed framework, and the effectiveness of parameters in each component. The experiment results show that our framework outperforms the state-of-the-art methods up to 14% in terms of mAP.

Abstract:
In this paper, we propose a novel space-time video super-resolution method, which aims to recover a high-frame-rate and high-resolution video from its low-frame-rate and low-resolution observation. Existing solutions seldom consider the spatial-temporal correlation and the long-term temporal context simultaneously and thus are limited in the restoration performance. Inspired by the epipolar-plane image used in multi-view computer vision tasks, we first propose the concept of temporal-profile super-resolution to directly exploit the spatial-temporal correlation in the long-term temporal context. Then, we specifically design a feature shuffling module for spatial retargeting and spatial-temporal information fusion, which is followed by a refining module for artifacts alleviation and detail enhancement. Different from existing solutions, our method does not require any explicit or implicit motion estimation, making it lightweight and flexible to handle any number of input frames. Comprehensive experimental results demonstrate that our method not only generates superior space-time video super-resolution results but also retains competitive implementation efficiency.

Abstract:
In this paper, we propose a novel optimization framework to synthesize an aesthetic pose for the virtual character with respect to the presented user's pose. Our approach applies aesthetic evaluation that exploits fully connected neural networks trained on example images. The aesthetic pose of the virtual character is obtained by optimizing a cost function that guides the rotation of each body joint angles. In our experiments, we demonstrate the proposed approach can synthesize poses for virtual characters according to user pose inputs. We also conducted objective and subjective experiments of the synthesized results to validate the efficacy of our approach.

Abstract:
Recently, to alleviate the data sparsity and cold start problem, many research efforts have been devoted to the usage of knowledge graph (KG) in recommender systems. It is common for most existing KG based models to represent users and items using real-valued embeddings. However, compared with complex or hypercomplex numbers, these real-valued vectors are of less representation capacity and no intrinsic asymmetrical properties, thus may limit the modeling of interactions between entities and relations in KG. In this paper, we propose Quaternion-based Knowledge Graph Network (QKGN) for recommendation, which represents users and items with quaternion embeddings in hypercomplex space, so that the latent inter-dependencies between entities and relations could be captured effectively. In the core of our model, a semantic matching principle based on Hamilton product is applied to learn expressive quaternion representations from the unified user-item KG. On top of this, those embeddings are attentively updated by a customized preference propagation mechanism with structure information concerned. Finally, we apply the proposed QKGN to three real-world datasets of music, movie and book, and experimental results show the validity of our method.

Abstract:
Infrared-visible cross-modality person re-identification (IV-ReID) has attracted much attention with the popularity of dual-mode video surveillance systems, where the RGB mode works in the daytime and automatically switches to the infrared mode at night. Despite its significant application value, IV-ReID remains a difficult problem mainly due to two great challenges. First, it is difficult to identify persons in the infrared image, which lacks color and texture clues. Second, there is a significant gap between the infrared and visible modalities where appearances of the same person vary considerably. This paper proposes a novel attention-based approach to handle the two difficulties in a unified framework. 1) We propose an attention lifting mechanism to learn discriminative features in each modality. 2) We propose a co-attentive learning mechanism to bridge the gap between the two modalities. Our method only makes slight modifications of a given backbone network and requires small computation overhead while improving the performance significantly. We conduct extensive experiments to demonstrate the superiority of our proposed method.

Abstract:
Metric-based few-shot learning methods concentrate on learning transferable feature embedding which generalizes well from seen categories to unseen categories under limited supervision. However, most of the methods treat each individual instance separately without considering its relationships with the others in the working context. We investigate a new metric-learning method to explicitly exploit these relationships. In particular, for an instance, we choose the samples that are visually similar from the working context, and perform weighted information propagation to attentively aggregate helpful information from the chosen samples to enhance its representation. We further formulate the distance metric as a learnable relation module which learns to compare for similarity measurement, and equip the working context with memory slots, both contributing to generality. We empirically demonstrate that the proposed method yields significant improvement over its ancestor and achieves competitive or even better performance when compared with other few-shot learning approaches on the two major benchmark datasets, i.e.mini Imagenet andtiered Imagenet.

Abstract:
In e-commerce, a growing number of user-generated videos are used for product promotion. How to generate video descriptions that narrate the user-preferred product characteristics depicted in the video is vital for successful promoting. Traditional video captioning methods, which focus on routinely describing what exists and happens in a video, are not amenable for product-oriented video captioning. To address this problem, we propose a product-oriented video captioner framework, abbreviated as Poet. Poet firstly represents the videos as product-oriented spatial-temporal graphs. Then, based on the aspects of the video-associated product, we perform knowledge-enhanced spatial-temporal inference on those graphs for capturing the dynamic change of fine-grained product-part characteristics. The knowledge leveraging module in Poet differs from the traditional design by performing knowledge filtering and dynamic memory modeling. We show that Poet achieves consistent performance improvement over previous methods concerning generation quality, product aspects capturing, and lexical diversity. Experiments are performed on two product-oriented video captioning datasets, buyer-generated fashion video dataset (BFVD) and fan-generated fashion video dataset (FFVD), collected from Mobile Taobao. We will release the desensitized datasets to promote further investigations on both video captioning and general video analysis problems.

Abstract:
The mobile-cloud based visual recognition (MCVR) system, in which the low-end mobile sensors are deployed to persistently collect and transmit visual data to the cloud for analysis and recognition, is important for visual monitoring applications such as wildfire detection, wildlife monitoring, etc. However, the current MCVR systems are mostly human-perception-oriented, which consume many computational resources and much energy for data sensing as well as much bandwidth for data transmission, limiting their large-scale deployment. In this work, we present a machine-perception-oriented MCVR system, called BS-MCVR, where the mobile end is designed to efficiently sense highly compact and discriminative features directly from the scene, and the sensed features are analyzed on the cloud for recognition. Particularly, the mobile end is designed to operate with completely binary operations and generate fixed-point feature maps. Experiments on benchmark datasets show that our system only needs to transmit 1/200 the amount of original image data without degrading much the recognition accuracy, while it consumes minimal computational cost in the data sensing process. BS-MCVR provides a highly cost-effective solution for deploying MCVR systems at a large-scale.

Abstract:
Given base classes with sufficient labeled samples, the target of few-shot classification is to recognize unlabeled samples of novel classes with only a few labeled samples. Most existing methods only pay attention to the relationship between labeled and unlabeled samples of novel classes, which do not make full use of information within base classes. In this paper, we make two contributions to investigate the few-shot classification problem. First, we report a simple and effective baseline trained on base classes in the way of traditional supervised learning, which can achieve comparable results to the state of the art. Second, based on the baseline, we propose a cooperative bi-path metric for classification, which leverages the correlations between base classes and novel classes to further improve the accuracy. Experiments on two widely used benchmarks show that our method is a simple and effective framework, and a new state of the art is established in the few-shot classification field.

Abstract:
Occlusions, scale variation and numerous false positives still represent fundamental challenges in pedestrian detection. Intuitively, different sizes of receptive fields and more attention to the visible parts are required for detecting pedestrians with various scales and occlusion levels, respectively. However, these challenges have not been addressed well by existing pedestrian detectors. This paper presents a novel convolutional network, denoted as box guided convolution network (BGCNet), to tackle these challenges simultaneously in a unified framework. In particular, we proposed a box guided convolution (BGC) that can dynamically adjust the sizes of convolution kernels guided by the predicted bounding boxes. In this way, BGCNet provides position-aware receptive fields to address the challenge of large variations of scales. In addition, for the issue of heavy occlusion, the kernel parameters of BGC are spatially localized around the salient and mostly visible key points of a pedestrian, such as the head and foot, to effectively capture high-level semantic features to help detection. Furthermore, a local maximum (LM) loss is introduced to depress false positives and highlight true positives by forcing positives, rather than negatives, as local maximums, without any additional inference burden. We evaluate BGCNet on popular pedestrian detection benchmarks, and achieve the state-of-the-art results, with the significant performance improvement on heavily occluded and small-scale pedestrians.

Abstract:
Detecting small-scale pedestrians is one of the most challenging problems in pedestrian detection. Due to the lack of visual details, the representations of small-scale pedestrians tend to be weak to be distinguished from background clutters. In this paper, we conduct an in-depth analysis of the small-scale pedestrian detection problem, which reveals that weak representations of small-scale pedestrians are the main cause for a classifier to miss them. To address this issue, we propose a novel Self-Mimic Learning (SML) method to improve the detection performance on small-scale pedestrians. We enhance the representations of small-scale pedestrians by mimicking the rich representations from large-scale pedestrians. Specifically, we design a mimic loss to force the feature representations of small-scale pedestrians to approach those of large-scale pedestrians. The proposed SML is a general component that can be readily incorporated into both one-stage and two-stage detectors, with no additional network layers and incurring no extra computational cost during inference. Extensive experiments on both the CityPersons and Caltech datasets show that the detector trained with the mimic loss is significantly effective for small-scale pedestrian detection and achieves state-of-the-art results on CityPersons and Caltech, respectively.

Abstract:
Recent years have witnessed increasing attention in cartoon media, powered by the strong demands of industrial applications. As the first step to understand this media, cartoon face recognition is a crucial but less-explored task with few datasets proposed. In this work, we first present a new challenging benchmark dataset, consisting of 389,678 images of 5,013 cartoon characters annotated with identity, bounding box, pose, and other auxiliary attributes. The dataset, named iCartoonFace, is currently the largest-scale, high-quality, rich-annotated, and spanning multiple occurrences in the field of image recognition, including near-duplications, occlusions, and appearance changes. In addition, we provide two types of annotations for cartoon media, i.e., face recognition, and face detection, with the help of a semi-automatic labeling algorithm. To further investigate this challenging dataset, we propose a multi-task domain adaptation approach that jointly utilizes the human and cartoon domain knowledge with three discriminative regularizations. We hence perform a benchmark analysis of the proposed dataset and verify the superiority of the proposed approach in the cartoon face recognition task. The dataset is available at https://iqiyi.cn/icartoonface.

Abstract:
Human novel view synthesis aims to synthesize target views of a human subject given input images taken from one or more reference viewpoints. Despite significant advances in model-free novel view synthesis, existing methods present two major limitations when applied to complex shapes like humans. First, these methods mainly focus on simple and symmetric objects, e.g., cars and chairs, limiting their performances to fine-grained and asymmetric shapes. Second, existing methods cannot guarantee visual consistency across different adjacent views of the same object. To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. Specifically, we introduce a novel multi-view supervision and an explicit rotational loss during the learning process, enabling the model to preserve detailed body parts and to achieve consistency between adjacent synthesized views. To show the superior performance of our approach, we present qualitative and quantitative results on the Multi-View Human Action (MVHA) dataset we collected (consisting of 3D human models animated with different Mocap sequences and captured from 54 different viewpoints), the Pose-Varying Human Model (PVHM) dataset, and ShapeNet. The qualitative and quantitative results demonstrate that our approach outperforms the state-of-the-art baselines in both per-view synthesis quality, and in preserving rotational consistency and complex shapes (e.g. fine-grained details, challenging poses) across multiple adjacent views in a variety of scenarios, for both humans and rigid objects.

Abstract:
Constructing fine-grained image datasets typically requires domain-specific expert knowledge, which is not always available for crowd-sourcing platform annotators. Accordingly, learning directly from web images becomes an alternative method for fine-grained visual recognition. However, label noise in the web training set can severely degrade the model performance. To this end, we propose a data-driven meta-set based approach to deal with noisy web images for fine-grained recognition. Specifically, guided by a small amount of clean meta-set, we train a selection net in a meta-learning manner to distinguish in- and out-of-distribution noisy images. To further boost the robustness of the model, we also learn a labeling net to correct the labels of in-distribution noisy data. In this way, our proposed method can alleviate the harmful effects caused by out-of-distribution noise and properly exploit the in-distribution noisy samples for training. Extensive experiments on three commonly used fine-grained datasets demonstrate that our approach is much superior to state-of-the-art noise-robust methods.

Abstract:
Recently, deep convolutional neural network (CNN) have been widely used in image restoration and obtained great success. However, most of existing methods are limited to local receptive field and equal treatment of different types of information. Besides, existing methods always use a multi-supervised method to aggregate different feature maps, which can not effectively aggregate hierarchical feature information. To address these issues, we propose an attention cube network (A-CubeNet) for image restoration for more powerful feature expression and feature correlation learning. Specifically, we design a novel attention mechanism from three dimensions, namely spatial dimension, channel-wise dimension and hierarchical dimension. The adaptive spatial attention branch (ASAB) and the adaptive channel attention branch (ACAB) constitute the adaptive dual attention module (ADAM), which can capture the long-range spatial and channel-wise contextual information to expand the receptive field and distinguish different types of information for more effective feature representations. Furthermore, the adaptive hierarchical attention module (AHAM) can capture the long-range hierarchical contextual information to flexibly aggregate different feature maps by weights depending on the global context. The ADAM and AHAM cooperate to form an 'attention in attention' structure, which means AHAM's inputs are enhanced by ASAB and ACAB. Experiments demonstrate the superiority of our method over state-of-the-art image restoration methods in both quantitative comparison and visual analysis.

Abstract:
Arbitrary style transfer is a significant topic with research value and application prospect. A desired style transfer, given a content image and referenced style painting, would render the content image with the color tone and vivid stroke patterns of the style painting while synchronously maintaining the detailed content structure information. Style transfer approaches would initially learn content and style representations of the content and style references and then generate the stylized images guided by these representations. In this paper, we propose the multi-adaptation network which involves two self-adaptation (SA) modules and one co-adaptation (CA) module:the SA modules adaptively disentangle the content and style representations, i.e., content SA module uses position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; the CA module rearranges the distribution of style representation based on content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion. Moreover, a new disentanglement loss function enables our network to extract main style patterns and exact content structures to adapt to various input images, respectively. Various qualitative and quantitative experiments demonstrate that the proposed multi-adaptation network leads to better results than the state-of-the-art style transfer methods.

Abstract:
Image translation across diverse domains has attracted more and more attention. Existing multi-domain image-to-image translation algorithms only learn the features of the complete image without considering specific features of local instances. To ensure the important instance to be more realistically translated, we propose a cross-granularity learning model for multi-domain image-to-image translation. We provide detailed procedures to capture the features of instances during the learning process, and specifically learn the relationship between style of the global image and the style of an instance on the image through the enforcing of the cross-granularity consistency. In our design, we only need one generator to perform the instance-aware multi-domain image translation. Our extensive experiments on several multi-domain image-to-image translation datasets show that our proposed method can achieve superior performance compared with the state-of-the-art approaches.

Abstract:
Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as well as search results are not interpretable, hindering subsequent steps in video browsing and query reformulation. This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. In this way, an embedding is associated with a list of semantic concepts as an interpretation of video content. This paper empirically demonstrates that, by using either the embedding features or concepts, considerable search improvement is attainable on TRECVid benchmarked datasets. Concepts are not only effective in pruning false positive videos, but also highly complementary to concept-free search, leading to large margin of improvement compared to state-of-the-art approaches.

Affiliations: School of Computer Science and Technology, University of Chinese Academy of Sciences, China ; Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, China ; State Key Lab. of Information Security, Institute of Information Engineering, CAS; School of Cyber Security, China ; Key Lab. of IIP, Inst. of Comput. Tech., CAS; Sch. of Computer Sci. and Tech., UCAS; Key Lab. of BDKM, CAS; Peng Cheng Lab.

Abstract:
Nowadays, click-through rate (CTR) prediction has achieved great success in online advertising. However, making desirable predictions for unseen ads is still challenging, which is known as the cold-start problem. To address such a problem in CTR prediction, meta-learning methods have recently emerged as a popular direction. In these approaches, the predictions for each user/item are regarded as individual tasks, then training a meta-learner on them to implement zero-shot/few-shot learning for unknown tasks. Though these approaches have effectively alleviated the cold-start problem, two facts are not paid enough attention, 1) the diversity of the task difficulty and 2) the perturbation of the task distribution. In this paper, we propose an adaptive loss that ensures the consistency between the task weight and difficulty. Interestingly, the loss function can also be viewed as a description of the worst-case performance under distribution perturbation. Moreover, we develop an algorithm, under the framework of gradient descent with max-oracle (GDmax), to minimize such an adaptive loss. Then we prove the algorithm can return to a stationary point of the adaptive loss. Finally, we implement our method on top of the meta-embedding framework and conduct experiments on three real-world datasets. The experiments show that our proposed method significantly improves the predictions in the cold-start scenario.

Abstract:
Light-field (LF) camera holds great promise for passive/general depth estimation benefited from high angular resolution, yet suffering small baseline for distanced region. While stereo solution with large baseline is superior to handle distant scenarios, the problem of limited angular resolution becomes bothering for near objects. Aiming for all-in-depth solution, we propose a cross-baseline LF camera using a commercial LF camera and a monocular camera, which naturally form a 'stereo camera' enabling compensated baseline for LF camera. The idea is simple yet non-trivial, due to the significant angular resolution gap and baseline gap between LF and stereo cameras.

Abstract:
In this paper, a new deep incomplete multi-view clustering network, called DIMC-net, is proposed to address the challenge of multi-view clustering on missing views. In particular, DIMC-net designs several view-specific encoders to extract the high-level information of multiple views and introduces a fusion graph based constraint to explore the local geometric information of data. To reduce the negative influence of missing views, a weighted fusion layer is introduced to obtain the consensus representation shared by all views. Moreover, a clustering layer is introduced to guarantee that the obtained consensus representation is the best one for the clustering task. Compared with the existing deep learning based approaches, DIMC-net is more flexible and efficient since it can handle all kinds of incomplete cases and directly produce the clustering results. Experimental results show that DIMC-net achieves significant improvement over state-of-the-art incomplete multi-view clustering methods.

Abstract:
Visual Grounding is the task of associating entities in a natural language sentence with objects in an image. In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. These two graphs are heterogeneous, representing structure layouts of the sentence and image, respectively. We learn unified contextual node representations of the two graphs by using a cross-modal graph convolutional network to reduce their discrepancy. The graph matching is thus relaxed as a linear assignment problem because the learned node representations characterize both node information and structure information. A permutation loss and a semantic cycle-consistency loss are further introduced to solve the linear assignment problem with or without ground-truth correspondences. Experimental results on two visual grounding tasks, i.e., referring expression comprehension and phrase localization, demonstrate the effectiveness of our method.

Abstract:
Most current image captioning systems focus on describing general image content, and lack background knowledge to deeply understand the image, such as exact named entities or concrete events. In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image. However, due to the length of news articles, previous works only employ news articles at the coarse article or sentence level, which are not fine-grained enough to refine relevant events and choose named entities accurately. To overcome these limitations, we propose an Information Concentrated Entity-aware news image CAPtioning (ICECAP) model, which progressively concentrates on relevant textual information within the corresponding news article from the sentence level to the word level. Our model first creates coarse concentration on relevant sentences using a cross-modality retrieval model and then generates captions by further concentrating on relevant words within the sentences. Extensive experiments on both BreakingNews and GoodNews datasets demonstrate the effectiveness of our proposed method, which outperforms other state-of-the-arts.

Abstract:
Although significant progress has been made in generating images from the text by using generative adversarial networks (GANs), it is still challenging to deal with long text, which contains complex semantic information like recipes. This paper focuses on generating images with high visual realism and semantic consistency from the complex text of recipes. To achieve this, we propose a GANs based method termed ChefGAN. The critical concept of ChefGAN is that a joint image-recipe embedding model is used before the generation task to provide high-quality representations of recipes, and it acts as an extra regularization during the generation to improve semantic consistency. Two modules are designed for this image text embedding module (ITEM) and a cascaded image generation module (CIGM). The generation process is carried out in 3 steps: (1) Two encoders in ITEM are trained simultaneously to generate similar representations for each image-recipe pair. (2) CIGM generates images according to the representations from ITEM's text encoder. (3) The generated image is fed into ITEM's image encoder to calculate the similarity with the given recipe. This process can provide additional regularization effect other than the impact of a discriminator. To facilitate convergence, we applied a two-stage training strategy, which generates an image with low resolution and then one with high resolution in the CIGM module. Compared with other representative state-of-the-art methods, ChefGAN demonstrates better performance both in visual realism and semantic consistency.

Abstract:
Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97× comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.

Abstract:
We present a novel real-time line segment detection scheme called Line Graph Neural Network (LGNN). Existing approaches require a computationally expensive verification or postprocessing step. Our LGNN employs a deep convolutional neural network (DCNN) for proposing line segment directly, with a graph neural network (GNN) module for reasoning their connectivities. Specifically, LGNN exploits a new quadruplet representation for each segment where the GNN module takes the predicted candidates as vertexes and constructs a sparse graph to enforce structural context. Compared with the state-of-the-art, LGNN achieves near real-time performance without compromising accuracy. LGNN further enables time-sensitive 3D applications. When a 3D point cloud is accessible, we present a multi-modal line segment classification technique for extracting a 3D wireframe of the environment robustly and efficiently.

Abstract:
We have built the scene-segmented video information annotation system and upgraded it to version 2.0. The system imports the video by user selection and splits into the scene units. Each scene clips are annotated by the integration of visual features derived by state-of-the-art deep learning techniques. The proposed system uses the multiview deep convolutional neural network for video segmentation and a supervised movie caption model for video annotation. Each functionality has been installed in two different sub-systems and connected through the web interface. The web interface allows connecting to external content providers in order to expand the capability of the system.

Abstract:
Real-time in-match soccer statistics provide continuous tracking of soccer ball and player positions and speeds, enabling advanced analytics. Currently, only elite soccer leagues have the luxury of tracking in-match soccer statistics operated with a large number of trained personnel. In this work, we present an Automated In-match Soccer Analysis System (AI-SAS), using a domain-knowledge-based multi-view global tracking. This system tracks player team, position, and speed automatically, providing real-time in-match team- and individual-level statistics and analyses. In comparison with the latest soccer analysis systems, AI-SAS is more scalable in streaming multiple video sources for real-time process and more flexible in hosting plug-and-play deep-learning-based tracking-by-detection algorithms. The global multi-view tracking also overcomes the single-view limitation and improves the tracking accuracy.

Abstract:
We present TindART - a comprehensive visual arts recommender system. TindART leverages real time user input to build a user-centric preference model based on content and demographic features. Our system is coupled with visual analytics controls that allow users to gain a deeper understanding of their art taste and further refine their personal recommendation model. The content based features in TindART are extracted using a multi-task learning deep neural network which accounts for a link between multiple descriptive attributes and the content they represent. Our demographic engine is powered by social media integrations such as Google, Facebook and Twitter profiles the users can login with. Both the content and demographics power a recommender system which decision making processed is visualized through our web t-SNE implementation. TindART is live and available at: https://tindart.net/.

Abstract:
To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). The speech exaggeration is realized by an emphatic speech generation neural network based on Tacotron, while the visual exaggeration is accomplished by ADC Viseme Blending, namely increasing Amplitude of movement, extending the phone's Duration and enhancing the color Contrast. User studies show that exaggerated feedback outperforms non-exaggerated version on helping learners with pronunciation identification and pronunciation improvement.

Abstract:
We consider the problem of building semantic relationship of unseen entities from free-form multi-modal sources. This intelligent agent understands semantic properties by creating (1) logical segments from sources, (2) finds interacting objects, (3) infers their interaction actions using (4) extracted textual, auditory, visual, and tonal information. The conversational dialogue discourses are automatically mapped to interacting co-located objects, and fused with their Kinetic action embeddings at each scene of occurrence. This generates a combined probability distribution representation for interacting entities spanning over every semantic relation class. Using these probabilities, we create knowledge graphs capable of answering semantic queries and infer missing properties in a given context.

Abstract:
Video anomaly detection (VAD) is currently a challenging task due to the complexity of "anomaly" as well as the lack of labor-intensive temporal annotations. In this paper, we propose an end-to-end Global Information Guided (GIG) anomaly detection framework for anomaly detection using the video-level annotations (i.e., weak labels). We propose to first mine the global pattern cues by leveraging the weak labels in a GIG module. Then we build a spatial reasoning module to measure the relevance between vectors in spatial domain with the global cue vectors, and select the most related feature vectors for temporal anomaly detection. The experimental results on the CityScene challenge demonstrate the effectiveness of our model.

Abstract:
We discuss the design and evaluation of machine learning algorithms that provide users with more control on the multimedia information they share. We introduce privacy threats for multimedia data and key features of privacy protection. We cover privacy threats and mitigating actions for images, videos, and motion-sensor data from mobile and wearable devices, and their protection from unwanted, automatic inferences. The tutorial offers theoretical explanations followed by examples with software developed by the presenters and distributed as open source.

Abstract:
Single image de-noising is an important yet under-explored task to estimate the underlying clean image from its noisy observation. It poses great challenges over the balance between over-de-noising (e.g., mistakenly remove texture details in noise-free regions) and under-de-noising (e.g., leave noisy points). Existing works solely treat the removal of noise from images as a process of pixel-wise regression and lack of preserving image details. In this paper, we firstly propose a Staged Memory Network (SMNet) consisting of noise memory stage and image memory stage for explicitly exploring the staged memories of our network in single image de-noising with different noise levels. Specifically, the noise memory stage is to reveal noise characteristics by using local-global spatial dependencies via an encoder-decoder sub-network composed of dense blocks and noise-aware blocks. Taking the residual result between the input noise image and the prediction of the noise memory stage as input, the image memory stage continues to get a noise-free and well-reconstructed output image via a contextual fusion sub-network with contextual blocks and a fusion block. Solid and comprehensive experiments on three tasks (i.e. synthetic and real data, and blind de-noising) demonstrate that our SMNet can significantly achieve better performance compared with state-of-the-art methods by cleaning noisy images with various densities, scales and intensities while keeping the image details of noise-free regions well-preserved. Moreover, interpretability analysis is added to further prove the ability of our composed memory stages.

Abstract:
Fine-grained visual recognition, which aims to identify subcategories of the same base-level category, is a challenging task because of its large intra-class variances and small inter-class variances. Human beings can perform object recognition task based on not only the visual appearance but also the knowledge from texts, as texts can point out the discriminative parts or characteristics which are always the key to distinguishing different subcategories. This is an involuntary transfer from human textual attention to visual attention, suggesting that texts are able to assist fine-grained recognition. In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. Specially, we first conduct a text-embedded network to embed text feature into the discriminative image feature learning to get a embedded feature. In addition, since the cross-layer part feature interaction and fine-grained feature learning are mutually correlated and can reinforce each other, we also extract a candidate feature from the text encoder and embed it into the inter-layer feature of the image encoder to get an embedded candidate feature. At last we utilize a cross-layer bilinear network to fuse the two embedded features. Comparing with state-of-the-art methods on the widely used CUB-200-2011 dataset and Oxford Flowers-102 dataset for fine-grained image recognition, the experimental results demonstrate our TEB model achieves the best performance.

Abstract:
Person re-identification (ReID) aims to match detected pedestrian images from multiple non-overlapping cameras. Most existing methods employ a backbone CNN to extract a vectorized feature representation by performing some global pooling operations (such as global average pooling and global max pooling) on the 3D feature map (i.e., the output of the backbone CNN). Although simple and effective in some situations, the global pooling operation only focuses on the statistical properties and ignores the spatial distribution of the feature map. Hence, it can not distinguish two feature maps when they have similar response values located in totally different positions. To handle this challenge, a novel method is proposed to learn the discriminative spatial features. Firstly, a self-constrained spatial transformer network (SC-STN) is introduced to handle the misalignments caused by detection errors. Then, based on the prior knowledge that the spatial structure of a pedestrian often keeps robust in vertical orientation of images, a novel vertical convolution network (VCN) is proposed to extract the spatial feature in vertical. Extensive experimental evaluations on several benchmarks demonstrate that the proposed method achieves state-of-the-art performances by introducing only a few parameters to the backbone.

Abstract:
Multimodal sentiment analysis is an emerging research field that aims to enable machines to recognize, interpret, and express emotion. Through the cross-modal interaction, we can get more comprehensive emotional characteristics of the speaker. Bidirectional Encoder Representations from Transformers (BERT) is an efficient pre-trained language representation model. Fine-tuning it has obtained new state-of-the-art results on eleven natural language processing tasks like question answering and natural language inference. However, most previous works fine-tune BERT only base on text data, how to learn a better representation by introducing the multimodal information is still worth exploring. In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. As the core unit of the CM-BERT, masked multimodal attention is designed to dynamically adjust the weight of words by combining the information of text and audio modality. We evaluate our method on the public multimodal sentiment analysis datasets CMU-MOSI and CMU-MOSEI. The experiment results show that it has significantly improved the performance on all the metrics over previous baselines and text-only finetuning of BERT. Besides, we visualize the masked multimodal attention and proves that it can reasonably adjust the weight of words by introducing audio modality information.

Abstract:
Domain adaptive person Re-Identification (ReID) is challenging owing to the domain gap and shortage of annotations on target scenarios. To handle those two challenges, this paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization (GLO), respectively. Different from previous methods that transfer knowledge in two stages, the DIM achieves a more efficient one-stage knowledge transfer by mapping images in labeled and unlabeled datasets to a shared feature space. GLO is designed to train the ReID model with unsupervised setting on the target domain. Instead of relying on existing optimization strategies designed for supervised training, GLO involves more images in distance optimization, and achieves better robustness to noisy label prediction. GLO also integrates distance optimizations in both the global dataset and local training batch, thus exhibits better training efficiency. Extensive experiments on three large-scale datasets,i.e., Market-1501, DukeMTMC-reID, andMSMT17, show that our coupling optimization outperforms state-of-the-art methods by a large margin. Our method also works well in unsupervised training, and even outperforms several recent domain adaptive methods.

Abstract:
UAV tracking is usually challenged by the dual-dynamic disturbances that arise from not only diverse moving target but also motion camera, leading to a more serious model drift issue than traditional visual tracking. In this work, we propose to alleviate this issue with distance-injected overlap maximization. Our idea is improving the accuracy of target localization by deriving a conceptually simple target localization loss and a global feature recalibration scheme in a mutual reinforced way. In particular, the target localization loss is designed by simply incorporating the normalized distance of target offset and generic semantic IoU loss, resulting in the distance-injected semantic IoU loss, and its minimal solution can alleviate the drift problem caused by camera motion. Moreover, the deep feature extractor is reconstructed and alternated with a feature recalibration network, which can leverage the global information to recalibrate significant features and suppress negligible features. Following by multi-scale feature concat, the proposed tracker can improve the discriminative capability of feature representation for UAV targets on the fly. Extensive experimental results on four benchmarks, i.e. UAV123, UAVDT, DTB70, and VisDrone, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on UAV tracking.

Abstract:
Representing features at multiple scales is significant for person re-identification (Re-ID). Most existing methods learn the multi-scale features by stacking streams and convolutions without considering the cooperation of multiple scales at a granular level. However, most scales are more discriminative only when they integrate other scales as contextual information. We termed that contextual multi-scale. In this paper, we proposed a novel architecture, namely contextual multi-scale network (CMSNet), for learning common and contextual multi-scale representations simultaneously. The building block of CMSNet obtains contextual multi-scale representations by bidirectionally hierarchical connection groups: the forward hierarchical connection group for stepwise inter-scale information fusion and the backward hierarchical connection group for leap-frogging inter-scale information fusion. Too rich scale features without a selection will confuse the discrimination. Additionally, we introduced a new channel-wise scale selection module to dynamically select scale features for corresponding input image. To the best of our knowledge, CMSNet is the most lightweight model for person Re-ID and it achieves state-of-the-art performance on four commonly used Re-ID datasets, surpassing most large-scale models.

Abstract:
Virtual musicians have become a remarkable phenomenon in the contemporary multimedia arts. However, most of the virtual musicians nowadays have not been endowed with abilities to create their own behaviors, or to perform music with human musicians. In this paper, we firstly create a virtual violinist, who can collaborate with a human pianist to perform chamber music automatically without any intervention. The system incorporates the techniques from various fields, including real-time music tracking, pose estimation, and body movement generation. In our system, the virtual musician's behavior is generated based on the given music audio alone, and such a system results in a low-cost, efficient and scalable way to produce human and virtual musicians' co-performance. The proposed system has been validated in public concerts. Objective quality assessment approaches and possible ways to systematically improve the system are also discussed.

Abstract:
We study the problem of image aesthetic assessment (IAA) and aim to automatically predict the image aesthetic quality in the form of discrete distribution, which is particularly important in IAA due to its nature of having possibly higher diversification of agreement for aesthetics. Previous works show the effectiveness of utilizing object-agnostic attention mechanisms to selectively concentrate on more contributive regions for IAA, e.g., attention is learned to weight pixels of input images when inferring aesthetic values. However, as suggested by some neuropsychology studies, the basic units of human attention are visual objects, i.e., the trace of human attention follows a series of objects. This inspires us to predict contributions of different regions at object level for better aesthetics evaluation. With our framework, region-of-interests (RoIs) are proposed by an object detector, and each RoI is associated with a regional feature vector. Then the contribution of each regional feature to the aesthetics prediction is adaptively determined. To the best of our knowledge, this is the first work modeling object-level attention for IAA and experimental results confirm the superiority of our framework over previous relevant methods.

Abstract:
Video quality assessment (VQA), which is capable of automatically predicting the perceptual quality of source videos especially when reference information is not available, has become a major concern for video service providers due to the growing demand for video quality of experience (QoE) by end users. While significant advances have been achieved from the recent deep learning techniques, they often lead to misleading results in VQA tasks given their limitations on describing 3D spatio-temporal regularities using only fixed temporal frequency. Partially inspired by psychophysical and vision science studies revealing the speed tuning property of neurons in visual cortex when performing motion perception (i.e., sensitive to different temporal frequencies), we propose a novel no-reference (NR) VQA framework named Recurrent-In-Recurrent Network (RIRNet) to incorporate this characteristic to prompt an accurate representation of motion perception in VQA task. By fusing motion information derived from different temporal frequencies in a more efficient way, the resulting temporal modeling scheme is formulated to quantify the temporal motion effect via a hierarchical distortion description. It is found that the proposed framework is in closer agreement with quality perception of the distorted videos since it integrates concepts from motion perception in human visual system (HVS), which is manifested in the designed network structure composed of low- and high- level processing. A holistic validation of our methods on four challenging video quality databases demonstrates the superior performances over the state-of-the-art methods.

Abstract:
Supervised cross-modal hashing has gained a lot of attention recently. However, most existing methods learn binary codes or hash functions in a batch-based scheme, which is inefficient in an online scenario, i.e., data points come in a streaming fashion. Online hashing is a promising solution; however, there still exist several challenges, e.g., how to effectively exploit semantic information, how to discretely solve the binary optimization problem, how to efficiently update hash codes and hash functions. To address these issues, in this paper, we propose a novel supervised online cross-modal hashing method, i.e., Label EMbedding ONline hashing, LEMON for short. It builds a label embedding framework including label similarity preserving and label reconstructing, which may generate discriminative binary codes and reduce the computational complexity. Furthermore, it not only preserves the pairwise similarity of incoming data, but also establishes a connection between newly coming data and existing data by the inner product minimization on a block similarity matrix. In the light of this, it can exploit more similarity information and make the optimization less sensitive to incoming data, leading to effective binary codes. In addition, we design a discrete optimization algorithm to solve the binary optimization problem without relaxation. Therefore, the quantization error can be reduced. Moreover, its computational complexity is only relevant to the size of incoming data, making it very efficient and scalable to large-scale datasets. Extensive experimental results on three benchmark datasets demonstrate that LEMON outperforms some state-of-the-art offline and online cross-modal hashing methods in terms of accuracy and efficiency.

Abstract:
We study an important, yet largely unexplored problem of large-scale cross-modal visual localization by matching ground RGB images to a geo-referenced aerial LIDAR 3D point cloud (rendered as depth images). Prior works were demonstrated on small datasets and did not lend themselves to scaling up for large-scale applications. To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km2 area) of RGB and aerial LIDAR depth images. We propose a novel joint embedding based method that effectively combines the appearance and semantic cues from both modalities to handle drastic cross-modal variations. Experiments on the proposed dataset show that our model achieves a strong result of a median rank of 5 in matching across a large test set of 50K location pairs collected from a 14km^2 area. This represents a significant advancement over prior works in performance and scale. We conclude with qualitative results to highlight the challenging nature of this task and the benefits of the proposed model. Our work provides a foundation for further research in cross-modal visual localization.

Abstract:
In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches. Existing image-to-image translation methods require a large-scale dataset of paired sketches and images for supervision. They typically utilize synthesized edge maps of face images as training data. However, these synthesized edge maps strictly align with the edges of the corresponding face images, which limit their generalization ability to real hand-drawn sketches with vast stroke diversity. To address this problem, we propose DeepFacePencil, an effective tool that is able to generate photo-realistic face images from hand-drawn sketches, based on a novel dual generator image translation network during training. A novel spatial attention pooling (SAP) is designed to adaptively handle stroke distortions which are spatially varying to support various stroke styles and different level of details. We conduct extensive experiments and the results demonstrate the superiority of our model over existing methods on both image quality and model generalization to hand-drawn sketches.

Abstract:
Image-text matching is a vital yet challenging task in the field of multimedia analysis. Over the past decades, great efforts have been made to bridge the semantic gap between the visual and textual modalities. Despite the significance and value, most prior work is still confronted with a multi-view description challenge, i.e., how to align an image to multiple textual descriptions with semantic diversity. Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. To be more specific, we design an adaptive gating self-attention module to extract representations of visual regions and words. By controlling the internal information flow, we are able to adaptively capture context information. Afterwards, we introduce a summarization module with a diversity regularization to aggregate region-level features into image-level ones from different perspectives. Ultimately, we devise a multi-view matching scheme to match multi-view image features with corresponding text ones. To justify our work, we have conducted extensive experiments on two benchmark datasets, i.e., Flickr30K and MS-COCO, which demonstrates the superiority of our model as compared to several state-of-the-art baselines.

Abstract:
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which achieved over 20% performance gain over existing hand-crafted baselines.

Abstract:
There are many domains where the temporal dimension is critical to unveil how different modalities, such as images and texts, are correlated. Notably, in the social media domain, information is constantly evolving over time according to the events that take place in the real world. In this work, we seek for highly expressive loss functions that allow the encoding of data temporal traits into cross-modal embedding spaces. To achieve this goal, we propose to steer the learning procedure of such embedding through a set of adaptively enforced temporal constraints. In particular, we propose a new formulation of the triplet loss function, where the traditional static margin is superseded by a novel temporally adaptive maximum margin function. This novel redesign of the static margin formulation, allows the embedding to effectively capture not only the semantic correlations across data modalities, but also data's fine-grained temporal correlations. Our experiments confirm the effectiveness of our model in structuring different modalities, while organizing data according to temporal correlations. Moreover, we experimentally highlight how can these embeddings be used for multimedia understanding.

Abstract:
In the CNN based object detectors, feature pyramids are widely exploited to alleviate the problem of scale variation across object instances. These object detectors, which strengthen features via a top-down pathway and lateral connections, are mainly to enrich the semantic information of low-level features, but ignore the enhancement of high-level features. This can lead to an imbalance between different levels of features, in particular a serious lack of detailed information in the high-level features, which makes it difficult to get accurate bounding boxes. In this paper, we introduce a novel two-pronged transductive idea to explore the relationship among different layers in both backward and forward directions, which can enrich the semantic information of low-level features and detailed information of high-level features at the same time. Under the guidance of the two-pronged idea, we propose a Two-Pronged Network (TPNet) to achieve bidirectional transfer between high-level features and low-level features, which is useful for accurately detecting object at different scales. Furthermore, due to the distribution imbalance between the hard and easy samples in single-stage detectors, the gradient of localization loss is always dominated by the hard examples that have poor localization accuracy. This will enable the model to be biased toward the hard samples. So in our TPNet, an adaptive IoU based localization loss, named Rectified IoU (RIoU) loss, is proposed to rectify the gradients of each kind of samples. The Rectified IoU loss increases the gradients of examples with high IoU while suppressing the gradients of examples with low IoU, which can improve the overall localization accuracy of model. Extensive experiments demonstrate the superiority of our TPNet and RIoU loss.

Abstract:
Partial domain adaptation (PDA) attracts appealing attention as it deals with a realistic and challenging problem when the source domain label space substitutes the target domain. Most conventional domain adaptation (DA) efforts concentrate on learning domain-invariant features to mitigate the distribution disparity across domains. However, it is crucial to alleviate the negative influence caused by the irrelevant source domain categories explicitly for PDA. In this work, we propose an Adaptively-Accumulated Knowledge Transfer framework (A^2KT) to align the relevant categories across two domains for effective domain adaptation. Specifically, an adaptively-accumulated mechanism is explored to gradually filter out the most confident target samples and their corresponding source categories, promoting positive transfer with more knowledge across two domains. Moreover, a dual distinct classifier architecture consisting of a prototype classifier and a multilayer perceptron classifier is built to capture intrinsic data distribution knowledge across domains from various perspectives. By maximizing the inter-class center-wise discrepancy and minimizing the intra-class sample-wise compactness, the proposed model is able to obtain more domain-invariant and task-specific discriminative representations of the shared categories data. Comprehensive experiments on several partial domain adaptation benchmarks demonstrate the effectiveness of our proposed model, compared with the state-of-the-art PDA methods.

Abstract:
Despite significant advances in deep learning based object detection in recent years, training effective detectors in a small data regime remains an open challenge. This is very important since labelling training data for object detection is often very expensive and time-consuming. In this paper, we investigate the problem of few-shot object detection, where a detector has access to only limited amounts of annotated data. Based on the meta-learning principle, we propose a new meta-learning framework for object detection named "Meta-RCNN", which learns the ability to perform few-shot detection via meta-learning. Specifically, Meta-RCNN learns an object detector in an episodic learning paradigm on the (meta) training data. This learning scheme helps acquire a prior which enables Meta-RCNN to do few-shot detection on novel tasks. Built on top of the popular Faster RCNN detector, in Meta-RCNN, both the Region Proposal Network (RPN) and the object classification branch are meta-learned. The meta-trained RPN learns to provide class-specific proposals, while the object classifier learns to do few-shot classification. The novel loss objectives and learning strategy of Meta-RCNN can be trained in an end-to-end manner. We demonstrate the effectiveness of Meta-RCNN in few-shot detection on three datasets (Pascal-VOC, ImageNet-LOC and MSCOCO) with promising results.

Abstract:
In this study, we address image retargeting, which is a task that adjusts input images to arbitrary sizes. In one of the best-performing methods called MULTIOP, multiple retargeting operators were combined and retargeted images at each stage were generated to find the optimal sequence of operators that minimized the distance between original and retargeted images. The limitation of this method is in its tremendous processing time, which severely prohibits its practical use. Therefore, the purpose of this study is to find the optimal combination of operators within a reasonable processing time; we propose a method of predicting the optimal operator for each step using a reinforcement learning agent. The technical contributions of this study are as follows. Firstly, we propose a reward based on self-play, which will be insensitive to the large variance in the content-dependent distance measured in MULTIOP. Secondly, we propose to dynamically change the loss weight for each action to prevent the algorithm from falling into a local optimum and from choosing only the most frequently used operator in its training. Our experiments showed that we achieved multi-operator image retargeting with less processing time by three orders of magnitude and the same quality as the original multi-operator-based method, which was the best-performing algorithm in retargeting tasks.

Abstract:
Due to the imaging limitation of depth sensors, high-resolution (HR) depth maps are often difficult to be acquired directly, thus effective depth super-resolution (DSR) algorithms are needed to generate HR output from its low-resolution (LR) counterpart. Previous methods treat all depth regions equally without considering different extents of degradation at region-level, and regard DSR under different scales as independent tasks without considering the modeling of different scales, which impede further performance improvement and practical use of DSR. To alleviate these problems, we propose a deep controllable slicing network from a novel perspective. Specifically, our model is to learn a set of slicing branches in a divide-and-conquer manner, parameterized by a distance-aware weighting scheme to adaptively aggregate different depths in an ensemble. Each branch that specifies a depth slice (e.g., the region in some depth range) tends to yield accurate depth recovery. Meanwhile, a scale-controllable module that extracts depth features under different scales is proposed and inserted into the front of slicing network, and enables finely-grained control of the depth restoration results of slicing network with a scale hyper-parameter. Extensive experiments on synthetic and real-world benchmark datasets demonstrate that our method achieves superior performance.

Abstract:
Scene text recognition (STR) has been extensively studied in last few years. Many recently-proposed methods are specially designed to accommodate the arbitrary shape, layout and orientation of scene texts, but ignoring that various font (or writing) styles also pose severe challenges to STR. These methods, where font features and content features of characters are tangled, perform poorly in text recognition on scene images with texts in novel font styles. To address this problem, we explore font-independent features of scene texts via attentional generation of glyphs in a large number of font styles. Specifically, we introduce trainable font embeddings to shape the font styles of generated glyphs, with the image feature of scene text only representing its essential patterns. The generation process is directed by the spatial attention mechanism, which effectively copes with irregular texts and generates higher-quality glyphs than existing image-to-image translation methods. Experiments conducted on several STR benchmarks demonstrate the superiority of our method compared to the state of the art.

Abstract:
Existing semantic segmentation models heavily rely on dense pixel-wise annotations. To reduce the annotation pressure, we focus on a challenging task named zero-shot semantic segmentation, which aims to segment unseen objects with zero annotations. This task can be accomplished by transferring knowledge across categories via semantic word embeddings. In this paper, we propose a novel context-aware feature generation method for zero-shot segmentation named CaGNet. In particular, with the observation that a pixel-wise feature highly depends on its contextual information, we insert a contextual module in a segmentation network to capture the pixel-wise contextual information, which guides the process of generating more diverse and context-aware features from semantic word embeddings. Our method achieves state-of-the-art results on three benchmark datasets for zero-shot segmentation.

Abstract:
This paper presents a DNN bottleneck reinforcement scheme to alleviate the vulnerability of Deep Neural Networks (DNN) against adversarial attacks. Typical DNN classifiers encode the input image into a compressed latent representation more suitable for inference.This information bottleneck makes a trade-off between the image-specific structure and class-specific information in an image. By reinforcing the former while maintaining the latter, any redundant information, be it adversarial or not, should be removed from the latent representation. Hence, this paper proposes to jointly train an auto-encoder (AE) sharing the same encoding weights with the visual classifier. In order to reinforce the information bottleneck,we introduce the multi-scale low-pass objective and multi-scale high-frequency communication for better frequency steering in the network. Unlike existing approaches, our scheme is the first reforming defense per se which keeps the classifier structure untouched without appending any pre-processing head and is trained with clean images only. Extensive experiments on MNIST, CIFAR-10 and ImageNet demonstrate the strong defense of our method againstvarious adversarial attacks.

Abstract:
In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters.

Abstract:
Action recognition is a relatively established task, where given an input sequence of human motion, the goal is to predict its action category. This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. Importantly, the set of generated motions are expected to maintain its diversity to be able to explore the entire action-conditioned motion space; meanwhile, each sampled sequence faithfully resembles a natural human body articulation dynamics. Motivated by these objectives, we follow the physics law of human kinematics by adopting the Lie Algebra theory to represent the natural human motions; we also propose a temporal Variational Auto-Encoder (VAE) that encourages a diverse sampling of the motion space. A new 3D human motion dataset, HumanAct12, is also constructed. Empirical experiments over three distinct human motion datasets (including ours) demonstrate the effectiveness of our approach.

Abstract:
Even though the multimedia data is ubiquitous on the web, the scarcity of the annotated data and variety of data modalities hinder their usage by multimedia applications. Heterogeneous domain adaptation (HDA) has therefore arisen to address such limitations by facilitating the knowledge transfer between heterogeneous domains. Existing HDA methods only focus on aligning the cross-domain feature distributions and ignore the importance of maximizing the margin among different classes, which may lead to a sub-optimal classification performance. To tackle this problem, in this paper, we propose the Prototype-Matching Graph Network (PMGN), which gradually explores the domain-invariant class prototype representations. Specifically, we build an end-to-end Graph Prototypical Network, which computes the class prototypes through multiple layers of edge learning, node aggregation, and discrepancy minimization. Our framework utilizes the Swap training strategy to provide adequate supervision for training the edge learning component. Moreover, the proposed PMGN can be equipped with the clustering module that utilises the KL-divergence as a distance metric to reduce the distribution difference between the source and target data. Extensive experiments on three HDA tasks (i.e. object recognition, text-to-image classification, and text categorization) demonstrate the superiority of our approach over the state-of-the-art HDA methods.

Abstract:
Human motion modelling is crucial in many areas such as computergraphics, vision and virtual reality. Acquiring high-quality skele-tal motions is difficult due to the need for specialized equipmentand laborious manual post-posting, which necessitates maximiz-ing the use of existing data to synthesize new data. However, it is a challenge due to the intrinsic motion stochasticity of humanmotion dynamics, manifested in the short and long terms. In theshort term, there is strong randomness within a couple frames, e.g.one frame followed by multiple possible frames leading to differentmotion styles; while in the long term, there are non-deterministicaction transitions. In this paper, we present Dynamic Future Net,a new deep learning model where we explicitly focuses on the aforementioned motion stochasticity by constructing a generative model with non-trivial modelling capacity in temporal stochas-ticity. Given limited amounts of data, our model can generate a large number of high-quality motions with arbitrary duration, andvisually-convincing variations in both space and time. We evaluateour model on a wide range of motions and compare it with the state-of-the-art methods. Both qualitative and quantitative results show the superiority of our method, for its robustness, versatility and high-quality.

Abstract:
Recently, extensive works based on convolutional neural network (CNN) have shown great success in single image super-resolution (SISR). In order to improve the SISR performance while reducing the number of model parameters, some methods adopt multiple recursive layers to enhance the intermediate features. However, in the recursive process, these methods only use the output features of current stage as the input of the next stage and neglect the output features of historical stages, which degrades the performance of the recursive blocks. The long-term dependencies can only be learned implicitly during the recursive processes. To address these issues, we propose the memory recursive network (MRNet) to make full use of the output features at each stage. The proposed MRNet utilizes a memory recursive module (MRM) to generate features for each recursive stage, and then these features are fused by our proposed ShuffleConv block. Specifically, MRM adopts a memory updater block to explicitly model the long-term dependencies between the output features of historical recursive stages. The output features from the memory updater will be used as the input of the next recursive stage and will be continuously updated during the recursions. To reduce the number of parameters and ease the training difficulty, we introduce a ShuffleConv module to fuse the features from different recursive stages, which is much more effective than using plain convolutional combinations. Comprehensive experiments demonstrate that the proposed MRNet achieves state-of-the-art SISR performance while using much fewer parameters.

Abstract:
This paper focuses on a novel task named masked faces recognition (MFR), which aims to match masked faces with common faces and is important especially during the global outbreak of COVID-19. It is challenging to identify masked faces for two main reasons. Firstly, there is no large-scale training data and test data with ground truth for MFR. Collecting and annotating millions of masked faces is labor-consuming. Secondly, since most facial cues are occluded by mask, it is necessary to learn representations which are both discriminative and robust to mask wearing. To handle the first challenge, this paper collects two datasets designed for MFR: MFV with 400 pairs of 200 identities for verification, and MFI which contains 4,916 images of 669 identities for identification. As is known, a robust face recognition model needs images of millions of identities to train, and hundreds of identities is far from enough. Hence, MFV and MFI are only considered as test datasets to evaluate algorithms. Besides, a data augmentation method for training data is introduced to automatically generate synthetic masked face images from existing common face datasets. In addition, a novel latent part detection (LPD) model is proposed to locate the latent facial part which is robust to mask wearing, and the latent part is further used to extract discriminative features. The proposed LPD model is trained in an end-to-end manner and only utilizes the original and synthetic training data. Experimental results on MFV, MFI and synthetic masked LFW demonstrate that LPD model generalizes well on both realistic and synthetic masked data and outperforms other methods by a large margin.

Abstract:
Domain adaptation aims at learning a predictive model that can generalize to a new target domain different from the source (training) domain. To mitigate the domain gap, adversarial training has been developed to learn domain invariant representations. State-of-the-art methods further make use of pseudo labels generated by the source domain classifier to match conditional feature distributions between the source and target domains. However, if the target domain is more complex than the source domain, the pseudo labels are unreliable to characterize the class-conditional structure of the target domain data, undermining prediction performance. To resolve this issue, we propose a Pairwise Similarity Regularization (PSR) approach that exploits cluster structures of the target domain data and minimizes the divergence between the pairwise similarity of clustering partition and that of pseudo predictions. Therefore, PSR guarantees that two target instances in the same cluster have the same class prediction and thus eliminate the negative effect of unreliable pseudo labels. Extensive experimental results show that our PSR method significantly boosts the current adversarial domain adaptation methods by a large margin on four visual benchmarks. In particular, PSR achieves a remarkable improvement of more than 5% over the state-of-the-art on several hard-to-transfer tasks.

Abstract:
Video frame synthesis is an important task in computer vision and has drawn great interests in wide applications. However, existing neural network methods do not explicitly impose tensor low-rankness of videos to capture the spatiotemporal correlations in a high-dimensional space, while existing iterative algorithms require hand-crafted parameters and take relatively long running time. In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. Our design is based on two observations: (i) both linear and nonlinear transforms can be implemented by a neural network layer, and (ii) the soft-thresholding operator corresponds to an activation function. Further, such an unfolding design is able to achieve nearly real-time at the cost of training time and enjoys an interpretable nature as a byproduct. Experimental results on the KTH and UCF-101 datasets show that compared with the state-of-the-art methods, i.e., DVF and Super SloMo, the proposed scheme improves Peak Signal-to-Noise Ratio (PSNR) of video interpolation and prediction by 4.13 dB and 4.26 dB, respectively.

Abstract:
Anomaly detection in videos is commonly referred to as the discrimination of events that do not conform to expected behaviors. Most existing methods formulate video anomaly detection as an outlier detection task and establish normal concept by minimizing reconstruction loss or prediction loss on training data. However, these methods performances suffer drops when they cannot guarantee either higher reconstruction errors for abnormal events or lower prediction errors for normal events. To avoid these problems, we introduce a novel contrastive representation learning task, Cluster Attention Contrast, to establish subcategories of normality as clusters. Specifically, we employ multi-parallel projection layers to project snippet-level video features into multiple discriminate feature spaces. Each of these feature spaces is corresponding to a cluster which captures distinct subcategory of normality, respectively. To acquire the reliable subcategories, we propose the Cluster Attention Module to draw thecluster attention representation of each snippet, then maximize the agreement of the representations from the same snippet under random data augmentations via momentum contrast. In this manner, we establish a robust normal concept without any prior assumptions on reconstruction errors or prediction errors. Experiments show our approach achieves state-of-the-art performance on benchmark datasets.

Abstract:
In the last years, the clothing industry has attracted a lot of interest from researchers. Increasing research efforts have been devoted into giving the buyer a way to improve the shopping experience by suggesting meaningful items to purchase. These efforts result in works aiming at suggesting good matches for clothes, but seem to lack one important aspect: understanding the user's interest. In fact, to suggest something it is first necessary to collect the user's personal interests, or something about his or her previous purchases. Without this information, no personalized suggestion can be made. User interest understanding allows to recognize if a user is showing interest in a product he or she is looking at, acquiring precious information that can be later leveraged. Usually user interest is associated to facial expressions, but these are known to be easily falsifiable. Moreover, when privacy is a concern, faces are often impossible to exploit. To address all these aspects, we propose an automatic system that aims to recognize the user's interest towards a garment by just looking at body posture and behaviour. To train and evaluate our system we create a body pose interest dataset, named BodyInterest, which consists of 30 users looking at garments for a total of approximately 6 hours of videos. Extensive evaluations show the effectiveness of our proposed method.

Abstract:
Domain adaptation aims to transfer knowledge from the source data with annotations to scarcely-labeled data in the target domain, which has attracted a lot of attention in recent years and facilitated many multimedia applications. Recent approaches have shown the effectiveness of using adversarial learning to reduce the distribution discrepancy between the source and target images by aligning distribution between source and target images at both image and instance levels. However, this remains challenging since two domains may have distinct background scenes and different objects. Moreover, complex combinations of objects and a variety of image styles deteriorate the unsupervised cross-domain distribution alignment. To address these challenges, in this paper, we design an end-to-end approach for unsupervised domain adaptation of object detector. Specifically, we propose a Multi-level Entropy Attention Alignment (MEAA) method that consists of two main components: (1) Local Uncertainty Attentional Alignment (LUAA) module to accelerate the model better perceiving structure-invariant objects of interest by utilizing information theory to measure the uncertainty of each local region via the entropy of the pixel-wise domain classifier and (2) Multi-level Uncertainty-Aware Context Alignment (MUCA) module to enrich domain-invariant information of relevant objects based on the entropy of multi-level domain classifiers. The proposed MEAA is evaluated in four domain-shift object detection scenarios. Experiment results demonstrate state-of-the-art performance on three challenging scenarios and competitive performance on one benchmark dataset.

Abstract:
In order to generate images for a given category, existing deep generative models generally rely on abundant training images. However, extensive data acquisition is expensive and fast learning ability from limited data is necessarily required in real-world applications. Also, these existing methods are not well-suited for fast adaptation to a new category. Few-shot image generation, aiming to generate images from only a few images for a new category, has attracted some research interest. In this paper, we propose a Fusing-and-Filling Generative Adversarial Network (F2GAN) to generate realistic and diverse images for a new category with only a few images. In our F2GAN, a fusion generator is designed to fuse the high-level features of conditional images with random interpolation coefficients, and then fills in attended low-level details with non-local attention module to produce a new image. Moreover, our discriminator can ensure the diversity of generated images by a mode seeking loss and an interpolation regression loss. Extensive experiments on five datasets demonstrate the effectiveness of our proposed method for few-shot image generation.

Abstract:
Outfit recommendation requires the answers of some challenging outfit compatibility questions such as 'Which pair of boots and school bag go well with my jeans and sweater?'. It is more complicated than conventional similarity search, and needs to consider not only visual aesthetics but also the intrinsic fine-grained and multi-category nature of fashion items. Some existing approaches solve the problem through sequential models or learning pair-wise distances between items. However, most of them only consider coarse category information in defining fashion compatibility while neglecting the fine-grained category information often desired in practical applications. To better define the fashion compatibility and more flexibly meet different needs, we propose a novel problem of learning compatibility among multiple tuples (each consisting of an item and category pair), and recommending fashion items following the category choices from customers. Our contributions include: 1) Designing a Mixed Category Attention Net (MCAN) which integrates both fine-grained and coarse category information into recommendation and learns the compatibility among fashion tuples. MCAN can explicitly and effectively generate diverse and controllable recommendations based on need. 2) Contributing a new dataset IQON, which follows eastern culture and can be used to test the generalization of recommendation systems. Our extensive experiments on a reference dataset Polyvore and our dataset IQON demonstrate that our method significantly outperforms state-of-the-art recommendation methods.

Abstract:
Timely detection of stress is desirable to address the increasingly serious stress problem. Thanks to the rich linguistic expressions and complete historical records on social media, achieving personalized stress detection through social media is feasible and prominent. We construct a three-leveled framework, aiming at personalized stress detection based on social media. The three-leveled framework learns the personalized stress representations following an increasingly detailed processing, i.e., from the generic mass level, group level, to the final individual level. The first mass-level focuses on mining the generic stress representations from people's linguistic and visual posts with a two-layer attention mechanism. The second group-level adopts the graph neural network to learn the group-wise characteristics of the group where an individual belongs to. The third individual-level analyzes and incorporates individual's personality traits into stress detection. The performance study on the 2,059 microblog users shows that our proposed method can achieve over 90% in detection accuracy. Furthermore, the extended experiment on a harder personalized sub-dataset demonstrates that our method works better in distinguishing personalized expressions with different latent meanings.

Abstract:
One non-negligible flaw of the convolutional neural networks (CNNs) based single image super-resolution (SISR) models is that most of them are not able to restore high-resolution (HR) images containing sufficient high-frequency information. Worse still, as the depth of CNNs increases, the training easily suffers from the vanishing gradients. These problems hinder the effectiveness of CNNs in SISR. In this paper, we propose the Dual-view Attention Networks to alleviate these problems for SISR. Specifically, we propose the local aware (LA) and global aware (GA) attentions to deal with LR features in unequal manners, which can highlight the high-frequency components and discriminate each feature from LR images in the local and global views, respectively. Furthermore, the local attentive residual-dense (LARD) block that combines the LA attention with multiple residual and dense connections is proposed to fit a deeper yet easy to train architecture. The experimental results verified the effectiveness of our model compared with other state-of-the-art methods.

Abstract:
In Compressive Sensing MRI (CS-MRI), measurement matrix learning has been developed as a promising method for measurement matrix designing. Research on MRI measurement task suggests that Relative 2-Norm Error (RLNE) of measurement images is imbalanced. However, current learning-based investigations suffer from the lack of probing imbalanced characteristic on measurement matrix learning. In this paper, we propose a novel Measurement Matrix Learning via Correlation Reweighting (MML-CR) approach for exploring and solving this problem by optimizing reweighted model.Specifically,we introduce a reweighting expected minimization model to obtain an essential measurement matrix in k-space. Besides, we propose an example correlation regularizer to prevent trivial solution for learning weights. Furthermore, we present an alternating solution and perform convergence analysis for the optimization. We also demonstrate quantitative and qualitative experimental results which show that our algorithm outperforms several state-of-art measurements methods. Compared with conventional methods, MML-CR achieves better performance on universal task.

Abstract:
Recent advances in machine and deep learning allow for enhanced retail analytics by applying object detection techniques. However, existing approaches either require laborious installation processes to function or lack precision when the customers turn their back in the installed cameras. In this paper, we present EyeShopper, an innovative system that tracks the gaze of shoppers when facing away from the camera and provides insights about their behavior in physical stores. EyeShopper is readily deployable in existing surveillance systems and robust against low-resolution video inputs. At the same time, its accuracy is comparable to state-of-the-art gaze estimation frameworks that require high-resolution and continuous video inputs to function. Furthermore, EyeShopper is more robust than state-of-the-art gaze tracking techniques for back head images. Extensive evaluation with different real video datasets and a synthetic dataset we produced shows that EyeShopper estimates with high accuracy the gaze of customers.

Abstract:
To achieve effective facial expression recognition (FER), it is of great importance to address various disturbing factors, including pose, illumination, identity, and so on. However, a number of FER databases merely provide the labels of facial expression, identity, and pose, but lack the label information for other disturbing factors. As a result, many methods are only able to cope with one or two disturbing factors, ignoring the heavy entanglement between facial expression and multiple disturbing factors. In this paper, we propose a novel Deep Disturbance-disentangled Learning (DDL) method for FER. DDL is capable of simultaneously and explicitly disentangling multiple disturbing factors by taking advantage of multi-task learning and adversarial transfer learning. The training of DDL involves two stages. First, a Disturbance Feature Extraction Model (DFEM) is pre-trained to perform multi-task learning for classifying multiple disturbing factors on the large-scale face database (which has the label information for various disturbing factors). Second, a Disturbance-Disentangled Model (DDM), which contains a global shared sub-network and two task-specific (i.e., expression and disturbance) sub-networks, is learned to encode the disturbance-disentangled information for expression recognition. The expression sub-network adopts a multi-level attention mechanism to extract expression-specific features, while the disturbance sub-network leverages adversarial transfer learning to extract disturbance-specific features based on the pre-trained DFEM. Experimental results on both the in-the-lab FER databases (including CK+, MMI, and Oulu-CASIA) and the in-the-wild FER databases (including RAF-DB and SFEW) demonstrate the superiority of our proposed method compared with several state-of-the-art methods.

Abstract:
Existing methods on facial expression recognition (FER) are mainly trained in the setting when all expression classes are fixed in advance. However, in real applications, expression classes are becoming increasingly fine-grained and incremental. To deal with sequential expression classes, we can fine-tune or re-train these models, but this often results in poor performance or large computing resources consumption. To address these problems, we develop an Incremental Facial Expression Recognition Network (IExpressNet), which can learn a competitive multi-class classifier at any time with a lower requirement of computing resources. Specifically, IExpressNet consists of two novel components. First, we construct an exemplar set by dynamically selecting representative samples from old expression classes. Then, the exemplar set and new expression classes samples constitute the training set. Second, we design a novel center-expression-distilled loss. As for facial expression in the wild, center-expression-distilled loss enhances the discriminative power of the deeply learned features and prevents catastrophic forgetting. Extensive experiments are conducted on two large-scale FER datasets in the wild, RAF-DB and AffectNet. The results demonstrate the superiority of the proposed method as compared to state-of-the-art incremental learning approaches.

Abstract:
As one of the most important forms of psychological behaviors, micro-expression can reveal the real emotion. However, the existing labeled micro-expression samples are limited to train a high performance micro-expression classifier. Since micro-expression and macro-expression share some similarities in facial muscle movements and texture changes, in this paper we propose a micro-expression recognition framework that leverages macro-expression samples as guidance. Specifically, we first introduce two Expression-Identity Disentangle Network, named MicroNet and MacroNet, as the feature extractor to disentangle expression-related features for micro and macro expression samples. Then MacroNet is fixed and used to guide the fine-tuning of MicroNet from both label and feature space. Adversarial learning strategy and triplet loss are added upon feature level between the MicroNet and MacroNet, so the MicroNet can efficiently capture the shared features of micro-expression and macro-expression samples. Loss inequality regularization is imposed to the label space to make the output of MicroNet converge to that of MicroNet. Comprehensive experiments on three public spontaneous micro-expression databases, i.e., SMIC, CASME2 and SAMM demonstrate the superiority of the proposed method.

Abstract:
With the prevailing of live video streaming, establishing an online pixelation method for privacy-sensitive objects is an urgency. Caused by the inaccurate detection of privacy-sensitive objects, simply migrating the tracking-by-detection structure applied in offline pixelation into the online form will incur problems in target initialization, drifting, and over-pixelation. To cope with the inevitable but impacting detection issue, we propose a novel Privacy-sensitive Objects Pixelation (PsOP) framework for automatic personal privacy filtering during live video streaming. Leveraging pre-trained detection networks, our PsOP is extendable to any potential privacy-sensitive objects pixelation. Employing the embedding networks and the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm as the backbone, our PsOP unifies the pixelation of discriminating and indiscriminating pixelation objects through trajectories generation. In addition to the pixelation accuracy boosting, experiment results on the streaming video data we built show that the proposed PsOP can significantly reduce the over-pixelation ratio in privacy-sensitive object pixelation.

Abstract:
Generating scene graph to describe the whereabouts and interactions of objects in an image has attracted increasing attention of researchers. Most existing methods explore object-level visual context or bodypart-object cooperation with the message passing structure, which can not meet the part-aware interaction nature of scene graph. Normally, a subject interacts with an object through crucial parts in each other. Besides, the correlation among parts within an identical object can also help predicting objects and their relationships. Hence, both of subject and object parts and their intra- and inter-object correlations should be fully considered for scene graph generation. In this paper, we propose a part-aware interactive learning method, which are divided into the intra-object and inter-object scenarios. First, we detect objects from an image and further decompose each one into a set of parts. Second, the part-aware graph attention module is proposed to refine part features via the intra-object message passing, and the refined features are incorporated for object inference. Third, the visual mutual attention module is designed to discover part-aware correlated visual cues precisely for predicate inference. It can highlight the subject-related object parts and the object-related subject parts during inter-object interactive learning. We demonstrate the superiority of our method against the state of the arts on Visual Genome. Ablation studies and visualization further validate its effectiveness.

Abstract:
Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

Abstract:
Current state-of-the-art image captioning systems generally produce a sentence from left to right, and every step is conditioned on the given image and previously generated words. Nevertheless, such autoregressive nature makes the inference process difficult to parallelize and leads to high captioning latency. In this paper, we propose a non-autoregressive approach for faster image caption generation. Technically, low-dimension continuous latent variables are shaped to capture semantic information and word dependencies from extracted image features before sentence decoding. Moreover, we develop an iterative back modification inference algorithm, which continuously refines the latent variables with a look back mechanism and parallelly generates the whole sentence based on the updated latent variables in a constant number of steps. Extensive experiments demonstrate that our method achieves competitive performance compared to prevalent autoregressive captioning models while significantly reducing the decoding time on average.

Abstract:
Game character customization is one of the core features of many recent Role-Playing Games (RPGs), where players can edit the appearance of their in-game characters with their preferences. This paper studies the problem of automatically creating in-game characters with a single photo. In recent literature on this topic, neural networks are introduced to make game engine differentiable and the self-supervised learning is used to predict facial customization parameters. However, in previous methods, the expression parameters and facial identity parameters are highly coupled with each other, making it difficult to model the intrinsic facial features of the character. Besides, the neural network based renderer used in previous methods is also difficult to be extended to multi-view rendering cases. In this paper, considering the above problems, we propose a novel method named "PokerFace-GAN" for neutral face game character auto-creation. We first build a differentiable character renderer which is more flexible than the previous methods in multi-view rendering cases. We then take advantage of the adversarial training to effectively disentangle the expression parameters from the identity parameters and thus generate player-preferred neutral face (expression-less) characters. Since all components of our method are differentiable, our method can be easily trained under a multi-task self-supervised learning paradigm. Experiment results show that our method can generate vivid neutral face game characters that are highly similar to the input photos. The effectiveness of our method is verified by comparison results and ablation studies.

Abstract:
Image colorization is an effective approach to provide plausible colors for grayscale images, which can achieve better and pleasing visual qualities. Although exemplar based colorization approaches provide promising results, they are relied on semantic colors or global colors only from the reference images. For the former situation, when the correspondence between the input grayscale image and reference image is not established, the colors of the reference image cannot be transferred to the input grayscale image successfully. With the later circumstance, because only global colors are considered, it is hard to produce a color image whose objects have the same color as the reference image when they are semantically related. Thus, an end-to-end colorization network Gray2ColorNet is proposed in this work, where an attention gating mechanism based color fusion network is designed to accomplish the colorization tasks. Relied on the proposed method, the semantic colors and global color distribution from the reference image are fused effectively, which are transferred to the final color images along with the prior knowledge of colors contained in the training data. The experimental results demonstrate the superior colorization performances of the proposed method compared to other state-of-the-art approaches.

Abstract:
This work proposes GangSweep, a new backdoor detection framework that leverages the super reconstructive power of Generative Adversarial Networks (GAN) to detect and ''sweep out'' neural backdoors. It is motivated by a series of intriguing empirical investigations, revealing that the perturbation masks generated by GAN are persistent and exhibit interesting statistical properties with low shifting variance and large shifting distance in feature space. Compared with the previous solutions, the proposed approach eliminates the reliance on the access to training data, and shows a high degree of robustness and efficiency for detecting and mitigating a wide range of backdoored models with various settings. Moreover, this is the first work that successfully leverages generative networks to defend against advanced neural backdoors with multiple triggers and their polymorphic forms.

Abstract:
Learning the relationship between the multi-modal data, e.g., texts, images and videos, is a classic task in the multimedia community. Cross-modal retrieval (CMR) is a typical example where the query and the corresponding results are in different modalities. Yet, a majority of existing works investigate CMR with an ideal assumption that the training samples in every modality are sufficient and complete. In real-world applications, however, this assumption does not always hold. Mismatch is common in multi-modal datasets. There is a high chance that samples in some modalities are either missing or corrupted. As a result, incomplete CMR has become a challenging issue. In this paper, we propose a Dual-Aligned Variational Autoencoders (DAVAE) to address the incomplete CMR problem. Specifically, we propose to learn modality-invariant representations for different modalities and use the learned representations for retrieval. We train multiple autoencoders, one for each modality, to learn the latent factors among different modalities. These latent representations are further dual-aligned at the distribution level and the semantic level to alleviate the modality gaps and enhance the discriminability of representations. For missing instances, we leverage generative models to synthesize latent representations for them. Notably, we test our method with different ratios of random incompleteness.Extensive experiments on three datasets verify that our method can consistently outperform the state-of-the-arts.

Abstract:
This paper aims to reduce the time to annotate images for panoptic segmentation, which requires annotating segmentation masks and class labels for all object instances and stuff regions. We formulate our approach as a collaborative process between an annotator and an automated assistant who take turns to jointly annotate an image using a predefined pool of segments. Actions performed by the annotator serve as a strong contextual signal. The assistant intelligently reacts to this signal by annotating other parts of the image on its own, which reduces the amount of work required by the annotator. We perform thorough experiments on the COCO panoptic dataset, both in simulation and with human annotators. These demonstrate that our approach is significantly faster than the recent machine-assisted interface of [Andriluka 18 ACMMM], and 2.4× to 5× faster than manual polygon drawing. Finally, we show on ADE20k that our method can be used to efficiently annotate new datasets, bootstrapping from a very small amount of annotated data.

Abstract:
Cloud rendering is an emerging technology in which rendering-heavy applications run on the cloud server and then stream the rendered contents to the end-user device. High density and high scalability of the cloud rendering services are crucial to support millions of users concurrently and cost-effectively. However, it is still challenging to run Android OS in cloud smoothly with high density and high scalability without compromising user experience. This paper presents DroidCloud, the first open-source Android\footnoteAndroid is a trademark of Google LLC. cloud rendering solution focusing on the scalable design and density aspect optimization to the best of our knowledge. To cloudify Android OS, DroidCloud utilizes thevHAL technology in order to support remote devices and keep transparent to Android applications. And aFlexible rendering scheduling policy is introduced to break the boundary of GPU physical locations. Thus, both remote GPUs and local GPUs can accommodate render tasks by forwarding rendering tasks and making it possible to support multiple Android OSes with GPU acceleration. Besides, to further improve the density, DroidCloud optimizes the resource cost both in a single instance and across instances. We show that DroidCloud can run hundreds of Android OSes on a single Intel Xeon server with GPU acceleration simultaneously, increasing the density at the scale of one order of magnitude compared to current cloud gaming systems. Further experimental results demonstrate that DroidCloud can transparently run Android applications at native speed with lower CPU, memory, and storage utilization.

Abstract:
Cross-modal hashing has attracted much attention in the large-scale multimedia search area. In many real applications, labels of samples have hierarchical structure which also contains much useful information for learning. However, most existing methods are originally designed for non-hierarchical labeled data and thus fail to exploit the rich information of the label hierarchy. In this paper, we propose an effective cross-modal hashing method, named Supervised Hierarchical Deep Cross-modal Hashing, SHDCH for short, to learn hash codes by explicitly delving into the hierarchical labels. Specifically, both the similarity at each layer of the label hierarchy and the relatedness across different layers are implanted into the hash-code learning. Besides, an iterative optimization algorithm is proposed to directly learn the discrete hash codes instead of relaxing the binary constraints. We conducted extensive experiments on two real-world datasets and the experimental results show the superior performance of SHDCH over several state-of-the-art methods.

Abstract:
The key to efficient person search is jointly localizing pedestrians and learning discriminative representation for person re-identification (re-ID). Some recently developed task-joint models are built with separate detection and re-ID branches on top of shared region feature extraction networks, where the large receptive field of neurons leads to background information redundancy for the following re-ID task. Our diagnostic analysis indicates the task-joint model suffers from considerable performance drop when the background is replaced or removed. In this work, we propose a subnet to fuse the bounding box features that pooled from multiple ConvNet stages in a bottom-up manner, termed bottom-up fusion (BUF) network. With a few parameters introduced, BUF leverages the multi-level features with different sizes of receptive fields to mitigate the background-bias problem. Moreover, the newly introduced segmentation head generates a foreground probability map as guidance for the network to focus on the foreground regions. The resulting foreground attention module (FAM) enhances the foreground features. Extensive experiments on PRW and CUHK-SYSU validate the effectiveness of the proposals. Our Bottom-Up Foreground-Aware Feature Fusion (BUFF) network achieves considerable gains over the state-of-the- arts on PRW and competitive performance on CUHK-SYSU.

Abstract:
Person search has recently gained increasing attention as the novel task of localizing and identifying a target pedestrian from a gallery of non-cropped scene images. Its performance depends on accurate person detection and re-identification simultaneously by learning effective representations. In this work, we propose a novel dual context-aware refinement network (DCRNet) for person search, which jointly explores two kinds of contexts including intra-instance context and inter-instance context to learn discriminative representation. Specifically, an intra-instance context module is designed to refine the representation for the bounding box of a pedestrian by leveraging its surrounding regions covering the same pedestrian and its accessories, which contain abundant complementary visual appearance of pedestrians. Moreover, an inter-instance context module is proposed to expand the instance-level feature for the bounding box of a pedestrian, by utilizing the rich scene contexts of neighboring co-travelers across images. These two modules are built on top of a joint detection and feature learning framework, i.e., Faster R-CNN. Extensive experimental results on two challenging datasets have demonstrated the effectiveness of DCRNet with significant performance improvements over state-of-the-art methods.

Abstract:
With rising awareness of environment protection and recycling, second-hand trading platforms have attracted increasing attention in recent years. The interaction data on second-hand trading platforms, consisting of sufficient interactions per user but rare interactions per item, is different from what they are on traditional platforms. Therefore, building successful recommendation systems in the second-hand trading platforms requires balancing modeling items? and users? preference, and mitigating the adverse effects of the sparsity, which makes recommendation especially challenging. Accordingly, we proposed a method to simultaneously learn representations of items and users from coarse-grained and fine-grained features, and a multi-task learning strategy is designed to address the issue of data sparsity. Experiments conducted on a real-world second-hand trading platform dataset demonstrate the effectiveness of our proposed model.

Abstract:
In this paper, we study the problem of weakly-supervised spatio-temporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but their reliance on object tracking will easily fail due to frequent camera shot cut in untrimmed videos. To this end, we propose a novel spatio-temporal multiple instance learning framework for untrimmed video grounding. Spatial MIL and temporal MIL are mutually guided to ground each query to specific spatial regions and the occurring frames of a video. Furthermore, an activity described in the sentence is captured to use the informative contextual cues for region proposals refinement and text representation. We conduct extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method outperforms state-of-the-art methods.

Abstract:
Multi-view learning reveals the latent correlation between different input modalities and has achieved outstanding performances in many fields. Recent approaches aim to find a low-dimensional subspace to reconstruct each view, in which the gross residual or noise follows either Gaussian or Laplacian distribution. However, the noise distribution is often more complex in practical applications, and a deterministic distribution assumption is incapable of modeling it. Additionally, referring to time-changed data, e.g., videos, the noise is temporal smooth, preventing us from processing the data with the whole input, as have generally been done in many existing multi-view learning methods. To tackle these problems, a novel online multi-view subspace learning is proposed in this paper. Particularly, our proposed method not only estimates a transformation for each view to extract the correlation among various views, but also introduces a Mixture of Gausssians (MoG) model into the multi-view data, successfully exploiting numbers of Gaussian Distributions to adaptively fit a wider range of the complex noise. Furthermore, we further design a novel online Expectation Maximization (EM) algorithm, being capable of efficiently processing the dynamic data. Experimental results substantiate the effectiveness and superiority of our approach.

Abstract:
We address the challenging task of event localization, which requires the machine to localize an event and recognize its category in unconstrained videos. Most existing methods leverage only the visual information of a video while neglecting its audio information, which, however, can be very helpful and important for event localization. For example, humans often recognize an event by reasoning with the visual and audio content simultaneously. Moreover, the audio information can guide the model to pay more attention on the informative regions of visual scenes, which can help to reduce the interference brought by the background. Motivated by these, in this paper, we propose a relation-aware network to leverage both audio and visual information for accurate event localization. Specifically, to reduce the interference brought by the background, we propose an audio-guided spatial-channel attention module to guide the model to focus on event-relevant visual regions. Besides, we propose to build connections between visual and audio modalities with a relation-aware module. In particular, we learn the representations of video and/or audio segments by aggregating information from the other modality according to the cross-modal relations. Last, relying on the relation-aware representations, we conduct event localization by predicting the event relevant score and classification score. Extensive experimental results demonstrate that our method significantly outperforms the state-of-the-arts in both supervised and weakly-supervised AVE settings.

Abstract:
Image captioning is gaining significance in multiple applications such as content-based visual search and chat-bots. Much of the recent progress in this field embraces a data-driven approach without deep consideration of human behavioural characteristics. In this paper, we focus on human-centered automatic image captioning. Our study is based on the intuition that different people will generate a variety of image captions for the same scene, as their knowledge and opinion about the scene may differ. In particular, we first perform a series of human studies to investigate what influences human description of a visual scene. We identify three main factors: a person's knowledge level of the scene, opinion on the scene, and gender. Based on our human study findings, we propose a novel human-centered algorithm that is able to generate human-like image captions. We evaluate the proposed model through traditional evaluation metrics, diversity metrics, and human-based evaluation. Experimental results demonstrate the superiority of our proposed model on generating diverse human-like image captions.

Abstract:
Trackers based on Siamese network have shown tremendous success, because of their balance between accuracy and speed. Nevertheless, with tracking scenarios becoming more and more sophisticated, most existing Siamese-based approaches ignore the addressing of the problem that distinguishes the tracking target from hard negative samples in the tracking phase. The features learned by these networks lack of discrimination, which significantly weakens the robustness of Siamese-based trackers and leads to suboptimal performance. To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. Through a distance constraint, we force to shorten the distance between exemplar vector and positive vectors, meanwhile, enlarge the distance between exemplar vector and hard negative vectors. Furthermore, we explore a novel anchor-free tracking framework in a per-pixel prediction fashion, which can significantly reduce the number of hyper-parameters and simplify the tracking process by taking full advantage of the representation of convolutional neural network. Extensive experiments on six standard benchmark datasets demonstrate that the proposed method can perform favorable results against state-of-the-art approaches.

Abstract:
Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input, and modifies the image generated in previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Network (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGAN model outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence and text-image consistency.

Abstract:
Portraits of No One is an internet artwork that generates and displays artificial photo-realistic portraits of human faces. This artwork assumes the form of a web page that synthesises new portraits by automatically recombining the facial features of the users who interacted with it. The generated portraits invoke the capabilities of Artificial Intelligence to generate visual content that makes people question themselves about the veracity of what they are seeing.

Abstract:
To support the replication of "Instance of Interest Detection", which was presented at MM'19, this companion paper provides the details of the artifacts. Instance of Interest Detection (IOID) aims to provide instance-level user interest modeling for image semantic description. In this paper, we explain the file structure of the source code and publish the details of our IOID dataset, which can be used to retrain the model with custom parameters. We also provide a program for component analysis to help other researchers to do experiments with alternative models that are not included in our experiments. Moreover, we provide a demo program for using our model easily.

Abstract:
Combining video streaming and online retailing (V2R) has been a growing trend recently. In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. The system consists of: 1) a back-end infrastructure providing optimized V2R related services including data engine, model repository, model serving and content matching; and 2) an application layer which enables rapid V2R application prototyping. Hysia addresses industry and academic needs in large-scale multimedia by: 1) seamlessly integrating state-of-the-art libraries including NVIDIA video SDK, Facebook faiss, and gRPC; 2) efficiently utilizing GPU computation; and 3) allowing developers to bind new models easily to meet the rapidly changing deep learning (DL) techniques. On top of that, we implement an orchestrator for further optimizing DL model serving performance. Hysia has been released as an open source project on GitHub, and attracted considerable attention. We have published Hysia to DockerHub as an official image for seamless integration and deployment in current cloud environments.

Abstract:
With the proliferation of the online fashion industry, there have been increased efforts towards building cutting-edge solutions for personalising fashion recommendation. Despite this, the technology is still limited by its poor performance on new entities, i.e. the cold-start problem. We attempt to address the cold-start problem for new users, by leveraging a novel visual preference modelling approach on a small set of input images. Additionally, we describe our proposed strategy to incorporate the modelled preference in occasion-oriented outfit recommendation. Finally, we propose Fashionist: a real-time web application to demonstrate our approach enabling personalised and diverse outfit recommendation for cold-start scenarios. Check out https://youtu.be/kuKgPCkoPy0 for demonstration.

Abstract:
Although person Re-Identification (ReID) is widely applied in a variety of multimedia systems, most of its essentially multifaceted output is evaluated and visualized inflexibly only using a list of images ranked by the similarity of image content, while the correlations between samples of different IDs and the spatial-temporal features of the images are underinvestigated. As system operators need a comfortable access to these important elements, we introduce an interactive design of a person ReID system to visualize these quantities. We demonstrate that a system offering these visual representations can effectively expedite and improve a person re-identification analysis and make it a much user-friendly experience.

Abstract:
Video relation detection problem refers to the detection of the relationship between different objects in videos, such as spatial relationship and action relationship. In this paper, we present video relation detection with trajectory-aware multi-modal features to solve this task. Considering the complexity of doing visual relation detection in videos, we decompose this task into three sub-tasks: object detection, trajectory proposal and relation prediction. We use the state-of-the-art object detection method to ensure the accuracy of object trajectory detection and multi-modal feature representation to help the prediction of relation between objects. Our method won the first place on the video relation detection task of Video Relation Understanding Grand Challenge in ACM Multimedia 2020 with 11.74% mAP, which surpasses other methods by a large margin.

Abstract:
The Pre-training for Video Captioning Challenge 2020 mainly focuses on developing video captioning systems by pre-training on the newly released large-scale Auto-captions on GIF dataset and further transferring the pre-trained model to MSR-VTT benchmark. As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. Additionally, we extend the decoder in our VideoTRM with mesh-like connections and gate fusion mechanism in multi-head attention during fine-tuning to take advantage of multi-level visual features and bypass less informative attention results, respectively. In the evaluation on test server, our VideoTRM achieves superior performances and ranks the second place on the leadboard finally.

Abstract:
Recently, video caption plays an important role in computer vision tasks. We participate in Pre-training for Video Captioning Challenge which aims to produce at least one sentence for each challenge video based on the pretraining models. In this work, we propose a tag guidance module to learn a representation which can better build the interaction in cross-modal between visual content and textual sentences. First, we utilize three types of features extraction networks to fully capture the information of 2D, 3D and object information. Second, to prevent overfitting and time issues, the entire process of training is divided into two stages. The first stage trains all data, and the second stage introduces a random dropout. Furthermore, we train a CNN-based network to pick out the best candidate results. In summary, we were ranked third place in Pre-training for Video Captioning Challenge which proved the effectiveness of our model.

Abstract:
The BioMedia 2020 ACM Multimedia Grand Challenge is the second in a series of competitions focusing on the use of multimedia for different medical use-cases. In this year's challenge, participants are asked to develop algorithms that automatically predict the quality of a given human semen sample using a combination of visual, patient-related, and laboratory-analysis-related data. Compared to last year's challenge, participants are provided with a fully multimodal dataset (videos, analysis data, study participant data) from the field of assisted human reproduction. The tasks encourage the use of the different modalities contained within the dataset and finding smart ways of how they may be combined to further improve prediction accuracy. For example, using only video data or combining video data and patient-related data. The ground truth was developed through a preliminary analysis done by medical experts following the World Health Organization's standard for semen quality assessment. The task lays the basis for automatic, real-time support systems for artificial reproduction. We hope that this challenge motivates multimedia researchers to explore more medical-related applications and use their vast knowledge to make a real impact on people's lives.

Abstract:
Human action recognition as an important application of computer vision has been studied for decades. Among various approaches, skeleton-based methods recently attract increasing attention due to their robust and superior performance. However, existing skeleton-based methods ignore the potential action relationships between different persons, while the action of a person is highly likely to be impacted by another person especially in complex events. In this paper, we propose a novel group-skeleton-based human action recognition method in complex events. This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons. In addition to the traditional key point coordinates, we also input the key point speed values to the networks for better performance. Then we use multilayer perceptrons (MLPs) to embed the distance values between the reference person and other persons into the extracted features. Lastly, all the features are fed into another MS-G3D for feature fusion and classification. For avoiding class imbalance problems, the networks are trained with a focal loss. The proposed algorithm is also our solution for the Large-scale Human-centric Video Analysis in Complex Events Challenge. Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.

Abstract:
Self-supervised learning of representations has important potential applications in human behaviour understanding. The ability to learn useful representations from large unlabeled datasets by modeling intrinsic properties of the data has been successfully employed in various fields of machine learning, often outperforming transfer learning or fully supervised training. My research interests lie in applying these ideas to multimodal human-centric data. In this extended abstract, I present the direction of research that I have followed during the first half of my PhD, along with ideas and work in progress for the second half. My completed research so far demonstrates the potential of cross-modal self-supervision for audio representation learning, especially on small downstream datasets. I want to explore similar ideas for visual and multimodal representation learning, and apply them to speech and emotion recognition and multimodal question answering.

Abstract:
Recent literature addressed the monocular multi-person 3D human pose estimation task very satisfactorily. In these studies, different persons are usually treated as independent pose instances to estimate. However, in many every-day situations, people are interacting, and different pose instances should be considered jointly since the pose of an individual depends on the pose of his/her interactees. This work aims to develop machine learning techniques for human pose estimation of persons involved in complex interactions, using the interaction information to improve the performance.

Abstract:
Owing to the rich emerging multimedia applications and services in the past decade, super large amount of multimedia data has been produced for the purpose of advanced research in multimedia. Furthermore, multimedia research has made great progress on image/video content analysis, multimedia search and recommendation, multimedia streaming, multimedia content delivery etc. At the same time, Artificial Intelligence (AI) has undergone a "new" wave of development since being officially regarded as an academic discipline in 1950s, which should give credits to the extreme success of deep learning. Thus, one question naturally arises: What happens when multimedia meets Artificial Intelligence?

Abstract:
New immersive imaging technologies enable creating multimedia systems that would increase the viewer presence and provide an immersive experience. This half-day tutorial aims to give an overview of these new immersive imaging systems and help the participants understand the content creation and delivery pipeline for the immersive imaging technologies. The tutorial will go over the full imaging pipeline, from camera setup for content capture, through content compression / streaming, to content display and related perceptual studies.

Abstract:
Instance Re-identification (ReID) system facilitates various applications that require painful and boring video watching. Its efficiency and effectiveness accelerate the process of video analysis. In this tutorial, we summarize ReID technologies and provide an overview. We'll introduce fundamental technologies, existing challenges, trends, etc. This tutorial would be useful for multimedia content analysis and system-level multimedia retrieval, especially for an effective and efficient open-world ReID system for the practical, large-scale, and open-set domain.

Abstract:
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area due to the significant spatial and temporal shifts across the source (i.e. training) and target (i.e. test) domains. As such, recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations and strengthen the feature transferability are not highly effective on the videos. To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. Specifically, the source and target frames are sampled as heterogeneous vertexes while the edges connecting two types of nodes measure the affinity among them. Through message-passing, each vertex aggregates the features from its heterogeneous neighbors, forcing the features coming from the same class to be mixed evenly. Explicitly exposing the video classifier to such cross-domain representations at the training and test stages makes our model less biased to the labeled source data, which in-turn results in achieving a better generalization on the target domain. The proposed framework is agnostic to the choices of frame aggregation, and therefore, four different aggregation functions are investigated for capturing appearance and temporal dynamics. To further enhance the model capacity and testify the robustness of the proposed architecture on difficult transfer tasks, we extend our model to work in a semi-supervised setting using an additional video-level bipartite graph. Extensive experiments conducted on four benchmark datasets evidence the effectiveness of the proposed approach over the state-of-the-art methods on the task of video recognition.

Abstract:
Image inpainting methods usually fail to reconstruct reasonable structure and fine-grained texture simultaneously. This paper handles this problem from a novel perspective of predicting low-frequency semantic structural contents and high-frequency detailed textures respectively, and proposes a multi-frequency probabilistic inference model(MPI model) to predict the multi-frequency information of missing regions by estimating the parametric distribution of multi-frequency features over the corresponding latent spaces. Firstly, in order to extract the information of different frequencies without any interference, wavelet transform is utilized to decompose the input image into low-frequency subband and high-frequency subbands. Furthermore, an MPI model is designed to estimate the underlying multi-frequency distribution of input images. With this model, closer approximation to the true posterior distribution can be constrained and maximum-likelihood assignment can be approximated. Finally, based on the proposed MPI model, a two-path network consisting of inference network(InferenceNet) and generation network(GenerationNet) is trained parallelly to enforce the consistency of global structure and local texture between the generated image and ground truth. We qualitatively and quantitatively compare our method with other state-of-the-art methods on Paris StreetView, CelebA, CelebAMask-HQ and Places2 datasets. The results show the superior performance of our method, especially in the aspects of realistic texture details and semantic structural consistency.

Abstract:
Unsupervised crowd counting is a challenging yet not largely explored task. In this paper, we explore it in a transfer learning setting where we learn to detect and count persons in an unlabeled target set by transferring bi-knowledge learnt from regression- and detection-based models in a labeled source set. The dual source knowledge of the two models is heterogeneous and complementary as they capture different modalities of the crowd distribution. We formulate the mutual transformations between the outputs of regression- and detection-based models as two scene-agnostic transformers which enable knowledge distillation between the two models. Given the regression- and detection-based models and their mutual transformers learnt in the source, we introduce an iterative self-supervised learning scheme with regression-detection bi-knowledge transfer in the target. Extensive experiments on standard crowd counting benchmarks, ShanghaiTech, UCF_CC_50, and UCF_QNRF demonstrate a substantial improvement of our method over other state-of-the-arts in the transfer learning setting.

Abstract:
Few-shot learning (FSL) aims at recognizing novel classes given only few training samples, which still remains a great challenge for deep learning. However, humans can easily recognize novel classes with only few samples. A key component of such ability is the compositional recognition that human can perform, which has been well studied in cognitive science but is not well explored in FSL. Inspired by such capability of humans, to imitate humans' ability of learning visual primitives and composing primitives to recognize novel classes, we propose an approach to FSL to learn a feature representation composed of important primitives, which is jointly trained with two parts, i.e. primitive discovery and primitive enhancing. In primitive discovery, we focus on learning primitives related to object parts by self-supervision from the order of image splits, avoiding extra laborious annotations and alleviating the effect of semantic gaps. In primitive enhancing, inspired by current studies on the interpretability of deep networks, we provide our composition view for the FSL baseline model. To modify this model for effective composition, inspired by both mathematical deduction and biological studies (the Hebbian Learning rule and the Winner-Take-All mechanism), we propose a soft composition mechanism by enlarging the activation of important primitives while reducing that of others, so as to enhance the influence of important primitives and better utilize these primitives to compose novel classes. Extensive experiments on public benchmarks are conducted on both the few-shot image classification and video recognition tasks. Our method achieves the state-of-the-art performance on all these datasets and shows better interpretability.

Abstract:
Existing deep learning based weakly supervised fine-grained image recognition (WFGIR) methods usually pick out the discriminative regions from the high-level feature (HLF) maps directly. However, as HLF maps are derived based on spatial aggregation of convolution which is basically a pattern matching process that applies fixed filters, it is ineffective to model visual contents of same semantic but varying posture or perspective. We argue that this will cause the selected discriminative regions of same sub-category are not semantically corresponding and thus degrade the WFGIR performance. To address this issue, we propose an end-to-end Category-specific Semantic Coherency Network (CSC-Net) to semantically align the discriminative regions of the same subcategory. Specifically, CSC-Net consists of: 1) Local-to-Attribute Projecting Module (LPM), which automatically learns a set of latent attributes via collecting the category-specific semantic details while eliminating the varying spatial distributions from the local regions. 2) Latent Attribute Aligning (LAA), which aligns the latent attributes to specific semantic via graph convolution based on their discriminability, to achieve category-specific semantic coherency; 3) Attribute-to-Local Resuming Module (ARM), which resumes the original Euclidean space of latent attributes and construct latent attribute aligned feature maps by a location-embedding graph unpooling operation. Finally, the new feature maps are used which applies the category-specific semantic coherency implicitly for more accurate discriminative regions localization. Extensive experiments verify that CSC-Net yields the best performance under the same settings with most competitive approaches, on CUB Bird, Stanford-Cars, and FGVC Aircraft datasets.

Abstract:
Fast appearance variations and the distractions of similar objects are two of the most challenging problems in visual object tracking. Unlike many existing trackers that focus on modeling only the target, in this work, we consider the transient variations of the whole scene. The key insight is that the object correspondence and spatial layout of the whole scene are consistent (i.e., global structure consistency) in consecutive frames which helps to disambiguate the target from distractors. Moreover, modeling transient variations enables to localize the target under fast variations. Specifically, we propose an effective and efficient short-term model that learns to exploit the global structure consistency in a short time and thus can handle fast variations and distractors. Since short-term modeling falls short of handling occlusion and out of the views, we adopt the long-short term paradigm and use a long-term model that corrects the short-term model when it drifts away from the target or the target is not present. These two components are carefully combined to achieve the balance of stability and plasticity during tracking. We empirically verify that the proposed tracker can tackle the two challenging scenarios and validate it on large scale benchmarks. Remarkably, our tracker improves state-of-the-art-performance on VOT2018 from 0.440 to 0.460, GOT-10k from 0.611 to 0.640, and NFS from 0.619 to 0.629.

Abstract:
In online learning systems, measuring the similarity between educational videos and exercises is a fundamental task with great application potentials. In this paper, we explore to measure the fine-grained similarity by leveraging multimodal information. The problem remains pretty much open due to several domain-specific characteristics. First, unlike general videos, educational videos contain not only graphics but also text and formulas, which have a fixed reading order. Both spatial and temporal information embedded in the frames should be modeled. Second, there are semantic associations between adjacent video segments. The semantic associations will affect the similarity and different exercises usually focus on the related context of different ranges. Third, the fine-grained labeled data for training the model is scarce and costly. To tackle the aforementioned challenges, we propose VENet to measure the similarity at both video-level and segment-level by just exploiting the video-level labeled data. Extensive experimental results on real-world data demonstrate the effectiveness of VENet.

Abstract:
One of the most difficult things in practicing musical instruments is improving timbre. Unlike pitch and rhythm, timbre is a high-dimensional and sensuous concept, and learners cannot evaluate their timbre by themselves. To efficiently improve their timbre control, learners generally need a teacher to provide feedback about timbre. However, hiring teachers is often expensive and sometimes difficult. Our goal is to develop a low-cost learning system that substitutes the teacher. We found that a variational autoencoder (VAE), which is an unsupervised neural network model, provides a 2-dimensional user-friendly mapping of timbre. Our system, SonoSpace, maps the learner's timbre into a 2D latent space extracted from an advanced player's performance. Seeing this 2D latent space, the learner can visually grasp the relative distance between their timbre and that of the advanced player. Although our system was evaluated mainly with an alto saxophone, SonoSpace could also be applied to other instruments, such as trumpets, flutes, and drums.

Abstract:
Deep learning based face parsing methods have attained state-of-the-art performance in recent years. Their superior performance heavily depends on the large-scale annotated training data. However, it is expensive and time-consuming to construct a large-scale pixel-level manually annotated dataset for face parsing. To alleviate this issue, we propose a novel Dual-Structure Disentangling Variational Generation (D2VG) network. Benefiting from the interpretable factorized latent disentanglement in VAE, D2VG can learn a joint structural distribution of facial image and its corresponding parsing map. Owing to these, it can synthesize large-scale paired face images and parsing maps from a standard Gaussian distribution. Then, we adopt both manually annotated and synthesized data to train a face parsing model in a supervised way. Since there are inaccurate pixel-level labels in synthesized parsing maps, we introduce a coarseness-tolerant learning algorithm, to effectively handle these noisy or uncertain labels. In this way, we can significantly boost the performance of face parsing. Extensive quantitative and qualitative results on HELEN, CelebAMask-HQ and LaPa demonstrate the superiority of our methods.

Abstract:
Recent image-specific Generative Adversarial Networks (GANs) provide a way to learn generative models from a single image instead of a large dataset. However, the semantic meaning of patches inside a single image is less explored. In this work, we first define the task of Semantic Image Analogy: given a source image and its segmentation map, along with another target segmentation map, synthesizing a new image that matches the appearance of the source image as well as the semantic layout of the target segmentation. To accomplish this task, we propose a novel method to model the patch-level correspondence between semantic layout and appearance of a single image by training a single-image GAN that takes semantic labels as conditional input. Once trained, a controllable redistribution of patches from the training image can be obtained by providing the expected semantic layout as spatial guidance. The proposed method contains three essential parts: 1) a self-supervised training framework, with a progressive data augmentation strategy and an alternating optimization procedure; 2) a semantic feature translation module that predicts transformation parameters in the image domain from the segmentation domain; and 3) a semantics-aware patch-wise loss that explicitly measures the similarity of two images in terms of patch distribution. Compared with existing solutions, our method generates much more realistic results given arbitrary semantic labels as conditional input.

Abstract:
Vehicle re-identification aims to identify the same vehicle across different surveillance cameras and plays an important role in public security. Existing approaches mainly focus on exploring informative regions or learning an appropriate distance metric. However, they not only neglect the inherent structured relationship between discriminative regions within an image, but also ignore the extrinsic structured relationship among images. The inherent and extrinsic structured relationships are crucial to learning effective vehicle representation. In this paper, we propose a Structured Graph ATtention network (SGAT) to fully exploit these relationships and allow the message propagation to update the features of graph nodes. SGAT creates two graphs for one probe image. One is an inherent structured graph based on the geometric relationship between the landmarks that can use features of their neighbors to enhance themselves. The other is an extrinsic structured graph guided by the attribute similarity to update image representations. Experimental results on two public vehicle re-identification datasets including VeRi-776 and VehicleID have shown that our proposed method achieves significant improvements over the state-of-the-art methods.

Abstract:
Benefiting from recent advances in deep learning, deep hashing methods have achieved promising performance in large-scale image retrieval. To improve storage and computational efficiency, existing hash codes need to be compressed accordingly. However, previous deep hashing methods have to retrain their models and then regenerate the whole database codes using the new models when code length changes, which is time consuming especially for large image databases. In this paper, we propose a novel deep hashing method, called Code Compression oriented Deep Hashing (CCDH), for efficiently compressing hash codes. CCDH learns deep hash functions for query images, while learning a one-hidden-layer Variational Autoencoder (VAE) from existing hash codes. With such asymmetric design, CCDH can efficiently compress database codes only using the learned encoder of VAE. Furthermore, CCDH is flexible enough to be used with a variety of deep hashing methods. Extensive experiments on three widely used image retrieval benchmarks demonstrate that CCDH can significantly reduce the cost for compressing database codes when code length changes while keeping the state-of-the-art retrieval accuracy.

Abstract:
Image aesthetic assessment involves both fine-grained details and the holistic layout of images. However, most of current approaches learn the local and the holistic information separately, which has a potential loss of contextual information. Additionally, learning-based methods mainly cast image aesthetic assessment as a binary classification or a regression problem, which cannot sufficiently delineate the potential diversity of human aesthetic experience. To address these limitations, we attempt to render the contextual information and model the varieties of aesthetic experience. Specifically, we explore a context-aware attention module in two dimensions: hierarchical and spatial. The hierarchical context is introduced to present the concern of multi-level aesthetic details while the spatial context is served to yield the long-range perception of images. Based on the attention model, we predict the distribution of human aesthetic ratings of images, which reflects the diversity and similarity of human subjective opinions. We conduct extensive experiments on the prevailing AVA dataset to validate the effectiveness of our approach. Experimental results demonstrate that our approach achieves state-of-the-art results.

Abstract:
Hand gesture interaction is a key component in Augmented Reality (AR) / Mixed Reality (MR). Users usually interact with AR/MR devices, e.g., Microsoft HoloLens, etc., via hand gestures to express their intentions and the devices will recognize the gestures and respond accordingly to users. However, the use of such technique so far is limited to only a few less-expressive hand gestures, which, unfortunately, are insufficient or inadequate to input complex information.

Abstract:
There are phenomena that cannot be measured without subjective testing. However, subjective testing is a complex issue with many influencing factors. These interplay to yield either precise or incorrect results. Researchers require a tool to classify results of subjective experiment as either consistent or inconsistent. This is necessary in order to decide whether to treat the gathered scores as quality ground truth data. Knowing if subjective scores can be trusted is key to drawing valid conclusions and building functional tools based on those scores (e.g., algorithms assessing the perceived quality of multimedia materials). We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. Additionally, the tool identifies stimuli having irregular score distribution. The approach is based on treating subjective scores as a random variable coming from the discrete Generalized Score Distribution (GSD). The GSD, in combination with a bootstrapped G-test of goodness-of-fit, allows to construct p-value P--P plot that visualizes experiment's consistency. The tool safeguards researchers from using inconsistent subjective data. In this way, it makes sure that conclusions they draw and tools they build are more precise and trustworthy. The proposed approach works in line with expectations drawn solely on experiment design descriptions of 21 real-life multimedia quality subjective experiments.

Abstract:
The automatic quality assessment of self-media online articles is an urgent and new issue, which is of great value to the online recommendation and search. Different from traditional and well-formed articles, self-media online articles are mainly created by users, which have the appearance characteristics of different text levels and multi-modal hybrid editing, along with the potential characteristics of diverse content, different styles, large semantic spans and good interactive experience requirements. To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. It is more consistent with the cognitive style of expressing an expert's evaluation of articles. We have also constructed a large scale real-world assessment dataset. Extensive experimental results show that the proposed framework significantly outperforms state-of-the-art methods, and effectively learns and integrates different factors of the online article quality assessment.

Abstract:
With the popularization of social websites, many methods have been proposed to explore the noisy tags for weakly-supervised image hashing.The main challenge lies in learning appropriate and sufficient information from those noisy tags. To address this issue, this work proposes a novel Masked visual-semantic Graph-based Reasoning Network, termed as MGRN, to learn joint visual-semantic representations for image hashing. Specifically, for each image, MGRN constructs a relation graph to capture the interactions among its associated tags and performs reasoning with Graph Attention Networks (GAT). MGRN randomly masks out one tag and then make GAT to predict this masked tag. This forces the GAT model to capture the dependence between the image and its associated tags, which can well address the problem of noisy tags. Thus it can capture key tags and visual structures from images to learn well-aligned visual-semantic representations. Finally, the auto-encoders is leveraged to learn hash codes that can preserve the local structure of the joint space. Meanwhile, the joint visual-semantic representations are reconstructed from those hash codes by using a decoder. Experimental results on two widely-used benchmark datasets demonstrate the superiority of the proposed method for image retrieval compared with several state-of-the-art methods.

Abstract:
Content streaming is the dominant application in today's Internet, which is typically distributed via content delivery networks (CDNs). CDNs usually use caching as a means to reduce user access latency so as to enable faster content downloads. Typical analysis of caching systems either focuses on content admission, which decides whether to cache a content, or content eviction to decide which content to evict when the cache is full. This paper instead proposes a novel framework that can simultaneously learn both content admission and content eviction for caching in CDNs. To attain this goal, we first put forward a lightweight architecture for content next request time prediction. We then leverage reinforcement learning (RL) along with the prediction to learn the time-varying content popularities for content admission, and develop a simple threshold-based model for content eviction. We call this new algorithm RL-Bélády (RLB). In addition, we address several key challenges to design learning-based caching algorithms, including how to guarantee lightweight training and prediction with both content eviction and admission in consideration, limit memory overhead, reduce randomness and improve robustness in RL stochastic optimization. Our evaluation results using 3 production CDN datasets show that RLB can consistently outperform state-of-the-art methods with dramatically reduced running time and modest overhead.

Abstract:
Painters can successfully recover severely damaged objects, yet current inpainting algorithms still can not achieve this ability. Generally, painters will have a conjecture about the seriously missing image before restoring it, which can be expressed in a text description. This paper imitates the process of painters' conjecture, and proposes to introduce the text description into the image inpainting task for the first time, which provides abundant guidance information for image restoration through the fusion of multimodal features. We propose a multimodal fusion learning method for image inpainting (MMFL). To make better use of text features, we construct an image-adaptive word demand module to reasonably filter the effective text features. We introduce a text guided attention loss and a text-image matching loss to make the network pay more attention to the entities in the text description. Extensive experiments prove that our method can better predict the semantics of objects in the missing regions and generate fine grained textures.

Abstract:
Most image captioning models achieve superior performances with the help of large-scale surprised training data, but it is prohibitively costly to label the image captions. To solve this problem, we propose a structural semantic adversarial active learning (SSAAL) model that leverages both visual and textual information for deriving the most representative samples while maximizing the image captioning performance. SSAAL consists of a semantic constructor, a snapshot& caption (SC) supervisor, and a labeled/unlabeled state discriminator. The constructor is designed to generate a structural semantic representation describing the objects, attributes and object relationships in the image. The SC supervisor is proposed to supervise this representation at the word-level and sentence-level in a multi-task learning manner, which directly relates the representation to ground-truth captions and updates it in the caption generating process. Finally, we introduce a state discriminator to predict the sample state and select images with sufficient semantic and fine-grained diversity. Extensive experiments on standard captioning dataset show that our model outperforms other active learning methods and achieves a competitive performance even though selecting a small amount of samples.

Abstract:
Model sound synthesis is a physically-based sound synthesis method used to generate audio content in games and virtual worlds. We present a novel learning-based impact sound synthesis algorithm called Deep-Modal. Our approach can handle sound synthesis for common arbitrary objects, especially dynamic generated objects, in real-time. We present a new compact strategy to represent the mode data, corresponding to frequency and amplitude, as fixed-length vectors. This is combined with a new network architecture that can convert shape features of 3D objects into mode data. Our network is based on an encoder-decoder architecture with the contact positions of objects and external forces embedded. Our method can synthesize interactive sounds related to objects of various shapes at any contact position, as well as objects of different materials and sizes. The synthesis process only takes ~0.01s on a GTX 1080 Ti GPU. We show the effectiveness of Deep-Modal through extensive evaluation using different metrics, including recall and precision of prediction, sound spectrogram, and a user study.

Abstract:
At this moment, GAN-based image generation methods are still imperfect, whose upsampling design has limitations in leaving some certain artifact patterns in the synthesized image. Such artifact patterns can be easily exploited (by recent methods) for difference detection of real and GAN-synthesized images. However, the existing detection methods put much emphasis on the artifact patterns, which can become futile if such artifact patterns were reduced.

Abstract:
Given an image and a natural language question, Visual Question Answering (VQA) aims at answering the textual question correctly. Most VQA approaches in literature targets at finding answers to the questions solely based on analyzing the given images and questions alone. Other works that try to incorporate external knowledge into VQA adopt a query-based search on knowledge graphs to obtain the answer. However, these works suffer from the following problem: the model training process heavily relies on the ground-truth knowledge facts which serve as supervised information --- missing these ground-truth knowledge facts during training will lead to failures in producing the correct answers. To solve the challenging issue, we propose a Knowledge Graph Augmented (KG-Aug) model which conducts context-aware knowledge aggregation on external knowledge graphs, requiring no ground-truth knowledge facts for extra supervision. The proposed KG-Aug model is capable of retrieving context-aware knowledge subgraphs given visual images and textual questions, and learning to aggregate the useful image- and question-dependent knowledge which is then utilized to boost the accuracy in answering visual questions. We carry out extensive experiments to validate the effectiveness of our proposed KG-Aug models against several baseline approaches on various datasets.

Abstract:
Referring expression segmentation (RES) aims to segment the target instance in a given image according to a natural language expression. Its main challenge lies in how to quickly and accurately align the text expression to the referred visual instances. In this paper, we focus on addressing this issue by proposing a Cascade Grouped Attention Network (CGAN) with two innovative designs: Cascade Grouped Attention (CGA) and Instance-level Attention (ILA) loss. Specifically, CGA is used to perform step-wise reasoning over the entire image to perceive the differences between instances accurately yet efficiently, so as to identify the referent. ILA loss is further embedded into each step of CGA to directly supervise the attention modeling, which improves the alignments between the text expression and the visual instances. Through these two novel designs, CGAN can achieve the high efficiency of one-stage RES while possessing a strong reasoning ability comparable to the two-stage methods. To validate our model, we conduct extensive experiments on three RES benchmark datasets and achieve significant performance gains over existing one-stage and multi-stage models

Abstract:
Recently, feature generating methods have been successfully applied to zero-shot learning (ZSL). However, most previous approaches only generate visual representations for zero-shot recognition. In fact, typical ZSL is a classic multi-modal learning protocol which consists of a visual space and a semantic space. In this paper, therefore, we present a new method which can simultaneously generate both visual representations and semantic representations so that the essential multi-modal information associated with unseen classes can be captured. Specifically, we address the most challenging issue in such a paradigm, i.e., how to handle the domain shift and thus guarantee that the learned representations are modality-invariant. To this end, we propose two strategies: 1) leveraging the mutual information between the latent visual representations and the semantic representations; 2) maximizing the entropy of the joint distribution of the two latent representations. By leveraging the two strategies, we argue that the two modalities can be well aligned. At last, extensive experiments on five widely used datasets verify that the proposed method is able to significantly outperform previous the state-of-the-arts.

Abstract:
We consider the problem of cross-view geo-localization. The primary challenge is to learn the robust feature against large viewpoint changes. Existing benchmarks can help, but are limited in the number of viewpoints. Image pairs, containing two viewpoints, e.g., satellite and ground, are usually provided, which may compromise the feature learning. Besides phone cameras and satellites, in this paper, we argue that drones could serve as the third platform to deal with the geo-localization problem. In contrast to traditional ground-view images, drone-view images meet fewer obstacles, e.g., trees, and provide a comprehensive view when flying around the target place. To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. University-1652 contains data from three platforms, i.e., synthetic drones, satellites and ground cameras of 1,652 university buildings around the world. To our knowledge, University-1652 is the first drone-based geo-localization dataset and enables two new tasks, i.e., drone-view target localization and drone navigation. As the name implies, drone-view target localization intends to predict the location of the target place via drone-view images. On the other hand, given a satellite-view query image, drone navigation is to drive the drone to the area of interest in the query. We use this dataset to analyze a variety of off-the-shelf CNN features and propose a strong CNN baseline on this challenging dataset. The experiments show that University-1652 helps the model to learn viewpoint-invariant features and also has good generalization ability in real-world scenarios.

Abstract:
Video object detection (VOD) has been a rising topic in recent years due to the challenges such as occlusion, motion blur, etc. To deal with these challenges, feature aggregation from local or global support frames is verified effective. To exploit better feature aggregation, in this paper, we propose two improvements over previous works: a class-constrained spatial-temporal relation network and a correlation-based feature alignment module. For the class constrained spatial-temporal relation network, it operates on object region proposals, and learns two kinds of relations: (1) the dependencies among region proposals of the same object class from support frames sampled in a long time range or even the whole sequence, and (2) spatial relations among proposals of different objects in the target frame. The homogeneity constraint in spatial-temporal relation network not only filters out many defective proposals but also implicitly embeds the traditional post-processing strategies (e.g., Seq-NMS), leading to a unified end-to-end training networks. In the feature alignment module, we propose a correlation based feature alignment method to align the support and target frames for feature aggregation in the temporal domain. Our experiments show that the proposed method improves the accuracy of single-frame detectors significantly, and outperforms previous temporal or spatial relation networks. Without bells or whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset (84.80% with ResNet-101) without any post-processing methods.

Abstract:
Recent research has demonstrated that adding some imperceptible perturbations to original images can fool deep learning models. However, the current adversarial perturbations are usually shown in the form of noises, and thus have no practical meaning. Image watermark is a technique widely used for copyright protection. We can regard image watermark as a king of meaningful noises and adding it to the original image will not affect people's understanding of the image content, and will not arouse people's suspicion. Therefore, it will be interesting to generate adversarial examples using watermarks. In this paper, we propose a novel watermark perturbation for adversarial examples (Adv-watermark) which combines image watermarking techniques and adversarial example algorithms. Adding a meaningful watermark to the clean images can attack the DNN models. Specifically, we propose a novel optimization algorithm, which is called Basin Hopping Evolution (BHE), to generate adversarial watermarks in the black-box attack mode. Thanks to the BHE, Adv-watermark only requires a few queries from the threat models to finish the attacks. A series of experiments conducted on ImageNet and CASIA-WebFace datasets show that the proposed method can efficiently generate adversarial examples, and outperforms the state-of-the-art attack methods. Moreover, Adv-watermark is more robust against image transformation defense methods.

Abstract:
In this work, we aim to learn an unpaired image enhancement model, which can enrich low-quality images with the characteristics of high-quality images provided by users. We propose a quality attention generative adversarial network (QAGAN) trained on unpaired data based on the bidirectional Generative Adversarial Network (GAN) embedded with a quality attention module (QAM). The key novelty of the proposed QAGAN lies in the injected QAM for the generator such that it learns domain-relevant quality attention directly from the two domains. More specifically, the proposed QAM allows the generator to effectively select semantic-related characteristics from the spatial-wise and adaptively incorporate style-related attributes from the channel-wise, respectively. Therefore, in our proposed QAGAN, not only discriminators but also the generator can directly access both domains which significantly facilitate the generator to learn the mapping function. Extensive experimental results show that, compared with the state-of-the-art methods based on unpaired learning, our proposed method achieves better performance in both objective and subjective evaluations.

Abstract:
A key of automatically generating vivid talking faces is to synthesize identity-preserving natural facial expressions beyond audio-lip synchronization, which usually need to disentangle the informative features from multiple modals and then fuse them together. In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. Different from talking face generation based on identity image and audio, an expressional video of arbitrary identity serves as the expression source in our approach. Expression encoder is proposed to disentangle expression-tailored representation from the guiding expressional video, while audio encoder disentangles audio-lip representation. Instead of using single image as identity input, multi-image identity encoder is proposed by learning different views of faces and merging a unified representation. Multiple discriminators are exploited to keep both image-aware and the video-aware realistic details, including a spatial-temporal discriminator for visual continuity of expression synthesis and facial movements. We conduct extensive experimental evaluations on quantitative metrics, expression retention quality and audio-visual synchronization. The results show the effectiveness of our ET-GAN in generating high quality expressional talking face videos against existing state-of-the-arts.

Abstract:
Salient object detection (SOD) is a crucial and preliminary task for many computer vision applications, which have made progress with deep CNNs. Most of the existing methods mainly rely on the RGB information to distinguish the salient objects, which faces difficulties in some complex scenarios. To solve this, many recent RGBD-based networks are proposed by adopting the depth map as an independent input and fuse the features with RGB information. Taking the advantages of RGB and RGBD methods, we propose a novel depth-aware salient object detection framework, which has following superior designs: 1) It does not rely on depth data in the testing phase. 2) It comprehensively optimizes SOD features with multi-level depth-aware regularizations. 3) The depth information also serves as error-weighted map to correct the segmentation process. With these insightful designs combined, we make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference, which not only surpasses the state-of-the-art performance on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin, while adopting less information and implementation light-weighted.

Abstract:
Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4% mAP with ResNeXt-101 without using any post-processing steps.

Abstract:
With the rapid development of facial manipulation techniques, face forgery has received considerable attention in multimedia and computer vision community due to security concerns. Existing methods are mostly designed for single-frame detection trained with precise image-level labels or for video-level prediction by only modeling the inter-frame inconsistency, leaving potential high risks for DeepFake attackers. In this paper, we introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. We address this problem by multiple instance learning framework, treating faces and input video as instances and bag respectively. A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction, rather than from instance embeddings to instance prediction and then to bag prediction in traditional MIL. Theoretical analysis proves that the gradient vanishing in traditional MIL is relieved in S-MIL. To generate instances that can accurately incorporate the partially manipulated faces, spatial-temporal encoded instance is designed to fully model the intra-frame and inter-frame inconsistency, which further helps to promote the detection performance. We also construct a new dataset FFPMS for partially attacked DeepFake video detection, which can benefit the evaluation of different methods at both frame and video levels. Experiments on FFPMS and the widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection. In addition, S-MIL can also be adapted to traditional DeepFake image detection tasks and achieve state-of-the-art performance on single-frame datasets.

Abstract:
Specular highlight detection is a challenging problem, and has many applications such as shiny object detection and light source estimation. Although various highlight detection methods have been proposed, they fail to disambiguate bright material surfaces from highlights, and cannot handle non-white-balanced images. Moreover, at present, there is still no benchmark dataset for highlight detection. In this paper, we present a large-scale real-world highlight dataset containing a rich variety of material categories, with diverse highlight shapes and appearances, in which each image is with an annotated ground-truth mask. Based on the dataset, we develop a deep learning-based specular highlight detection network (SHDNet) leveraging multi-scale context contrasted features to accurately detect specular highlights of varying scales. In addition, we design a binary cross-entropy (BCE) loss and an intersection-over-union edge (IoUE) loss for our network. Compared with existing highlight detection methods, our method can accurately detect highlights of different sizes, while effectively excluding the non-highlight regions, such as bright materials, non-specular as well as colored lighting, and even light sources.

Abstract:
Video super-resolution (SR) aims at generating high-resolution (HR) frames from consecutive low-resolution (LR) frames. The challenge is how to make use of temporal coherence among neighbouring LR frames. Most previous works use motion estimation and compensation based models. However, their performance relies heavily on motion estimation accuracy. In this paper, we propose a multi-scale pyramid 3D convolutional (MP3D) network for video SR, where 3D convolution can explore temporal correlation directly without explicit motion compensation. Specifically, we first apply 3D convolution into a pyramid subnet to extractmulti-scale spatial and temporal features simultaneously from the LR frames, such that it can handle various sizes of motions. We then feed the fused feature maps into an SR reconstruction subnet, where a 3D sub-pixel convolution layer is used for up-sampling. Finally, we append a detail refinement subnet based on the encoder-decoder structure to further enhance texture details of the reconstructed HR frames. Extensive experiments on benchmark datasets and real-world cases show that the proposed MP3D model outperforms state-of-the-art video SR methods in terms of PSNR/SSIM values, visual quality and temporal consistency, respectively.

Abstract:
Generative Adversarial Networks (GANs) have been employed for face super resolution but they bring distorted facial details easily and still have weakness on recovering realistic texture. To further improve the performance of GAN-based models on super-resolving face images, we propose PCA-SRGAN which pays attention to the cumulative discrimination in the orthogonal projection space spanned by PCA projection matrix of face data. By feeding the principal component projections ranging from structure to details into the discriminator, the discrimination difficulty will be greatly alleviated and the generator can be enhanced to reconstruct clearer contour and finer texture, helpful to achieve the high perception and low distortion eventually. This incremental orthogonal projection discrimination has ensured a precise optimization procedure from coarse to fine and avoids the dependence on the perceptual regularization. We conduct experiments on CelebA and FFHQ face datasets. The qualitative visual effect and quantitative evaluation have demonstrated the overwhelming performance of our model over related works.

Abstract:
Grounding objects in visual context from natural language queries is a crucial yet challenging vision-and-language task, which has gained increasing attention in recent years. Existing work has primarily investigated this task in the context of still images. Despite their effectiveness, these methods cannot be directly migrated into the video context, mainly due to 1) the complex spatio-temporal structure of videos and 2) the scarcity of fine-grained annotations of videos. To effectively ground objects in videos is profoundly more challenging and less explored.

Abstract:
Video scene detection is the task of dividing videos into temporal semantic chapters. This is an important preliminary step before attempting to analyze heterogeneous video content. Recently, Optimal Sequential Grouping (OSG) was proposed as a powerful unsupervised solution to solve a formulation of the video scene detection problem. In this work, we extend the capabilities of OSG to the learning regime. By giving the capability to both learn from examples and leverage a robust optimization formulation, we can boost performance and enhance the versatility of the technology. We present a comprehensive analysis of incorporating OSG into deep learning neural networks under various configurations. These configurations include learning an embedding in a straight-forward manner, a tailored loss designed to guide the solution of OSG, and an integrated model where the learning is performed through the OSG pipeline. With thorough evaluation and analysis, we assess the benefits and behavior of the various configurations, and show that our learnable OSG approach exhibits desirable behavior and enhanced performance compared to the state of the art.

Abstract:
Greedy-NMS inherently raises a dilemma, where a lower NMS threshold will potentially lead to a lower recall rate and a higher threshold introduces more false positives. This problem is more severe in pedestrian detection because the instance density varies more intensively. However, previous works on NMS don't consider or vaguely consider the factor of the existent of nearby pedestrians. Thus, we propose \heatmapname (\heatmapnameshort ), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with \nmsname, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. Compared to Greedy-NMS, our method, as the state-of-the-art, improves by 3.9% AP, 5.1% Recall, and 0.8% MR\textsuperscript-2 on CrowdHuman to 89.0% AP and 92.9% Recall, and 43.9% MR\textsuperscript-2 respectively.

Abstract:
Weakly Supervised Object Localization (WSOL) aims to learn object locations in a given image while only using image-level annotations. For highlighting the whole object regions instead of the discriminative parts, previous works often attempt to train classification model for both classification and localization tasks. However, it is hard to achieve a good tradeoff between the two tasks, if only classification labels are employed for training on a single classification model. In addition, all of recent works just perform localization based on the last convolutional layer of classification model, ignoring the localization ability of other layers. In this work, we propose an offline framework to achieve precise localization on any convolutional layer of a classification model by exploiting two kinds of gradients, called Dual-Gradients Localization (DGL) framework. DGL framework is developed based on two branches: 1) Pixel-level Class Selection, leveraging gradients of the target class to identify the correlation ratio of pixels to the target class within any convolutional feature maps, and 2) Class-aware Enhanced Maps, utilizing gradients of classification loss function to mine entire target object regions, which would not damage classification performance. Extensive experiments on public ILSVRC and CUB-200-2011 datasets show the effectiveness of the proposed DGL framework. Especially, our DGL obtains a new state-of-the-art Top-1 localization error of 43.55% on the ILSVRC benchmark.

Abstract:
Compressed video action recognition has drawn growing attention for the storage and processing advantages of compressed videos over original raw videos. While the past few years have witnessed remarkable progress in this problem, most existing approaches rely on RGB frames from raw videos and require multi-step training. In this paper, we propose a novel Slow-I-Fast-P (SIFP) neural network model for compressed video action recognition. It consists of the slow I pathway receiving a sparse sampling I-frame clip and the fast P pathway receiving a dense sampling pseudo optical flow clip. An unsupervised estimation method and a new loss function are designed to generate pseudo optical flows in compressed videos. Our model eliminates the dependence on the traditional optical flows calculated from raw videos. The model is trained in an end-to-end way. The proposed method is evaluated on the challenging HMDB51 and UCF101 datasets. The extensive comparison results and ablation studies demonstrate the effectiveness and strength of the proposed method.

Abstract:
Though recent methods on semi-supervised video object segmentation (VOS) have achieved an appreciable improvement of segmentation accuracy, it is still hard to get an adequate speed-accuracy balance when facing real-world application scenarios. In this work, we propose Discriminative Matching for real-time Video Object Segmentation (DMVOS), a real-time VOS framework with high-accuracy to fill this gap. Based on the matching mechanism, our framework introduces discriminative information through the Isometric Correlation module and the Instance Center Offset module. Specifically, the isometric correlation module learns a pixel-level similarity map with semantic discriminability, and the instance center offset module is applied to exploit the instance-level spatial discriminability. Experiments on two benchmark datasets show that our model achieves state-of-the-art performance with extremely fast speed, for example, J&F of 87.8% on DAVIS-2016 validation set with 35 milliseconds per frame.

Abstract:
In the weakly supervised segmentation task with only image-level labels, a common step in many existing algorithms is first to locate the image regions corresponding to each existing class with the Class Activation Maps (CAMs), and then generate the pseudo ground truth masks based on the CAMs to train a segmentation network in the fully supervised manner. The quality of the CAMs has a crucial impact on the performance of the segmentation model. We propose to improve the CAMs from a novel graph perspective. We model paired images containing common classes with a bipartite graph and use the maximum matching algorithm to locate corresponding areas in two images. The matching areas are then used to refine the predicted object regions in the CAMs. The experiments on Pascal VOC 2012 dataset show that our network can effectively boost the performance of the baseline model and achieves new state-of-the-art performance.

Abstract:
Face detection is a hot topic in computer vision. The face detection methods usually consist of two subtasks, i.e. the classification subtask and the regression subtask, which are trained with different samples. However, current face detection knowledge distillation methods usually couple the two subtasks, and use the same set of samples in the distillation task. In this paper, we propose a task decoupled knowledge distillation method, which decouples the detection distillation task into two subtasks and uses different samples in distilling the features of different subtasks. We firstly propose a feature decoupling method to decouple the classification features and the regression features, without introducing any extra calculations at inference time. Specifically, we generate the corresponding features by adding task-specific convolutions in the teacher network and adding adaption convolutions on the feature maps of the student network. Then we select different samples for different subtasks to imitate. Moreover, we also propose an effective probability distillation method to joint boost the accuracy of the student network. We apply our distillation method on a lightweight face detector, EagleEye. Experimental results show that the proposed method effectively improves the student detector's accuracy by 5.1%, 5.1%, and 2.8% AP in Easy, Medium, Hard subsets respectively.

Abstract:
We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case, different modalities of the same video are treated as positives and video clips from a different video are treated as negatives. Because the spatio-temporal information is important for video representation, we extend the negative samples by introducing intra-negative samples, which are transformed from the same anchor video by breaking temporal relations in video clips. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn video representations. There are many flexible options in our IIC framework and we conduct experiments by using several different configurations. Evaluations are conducted on video retrieval and video recognition tasks using the learned video representation. Our proposed IIC outperforms current state-of-the-art results by a large margin, such as 16.7% and 9.5% points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, improvements can also be obtained on these two benchmark datasets.

Abstract:
With the rapid growth of Internet media, content tagging has become an important topic with many multimedia understanding applications, including efficient organisation and search. Nevertheless, existing visual tagging approaches are susceptible to inherent privacy risks in which private information may be exposed unintentionally. The use of anonymisation and privacy-protection methods is desirable, but with the expense of task performance. Therefore, this paper proposes an end-to-end framework (SGTN) using Graph Transformer and Convolutional Networks to significantly improve classification and privacy preservation of visual data. Especially, we employ several mechanisms such as differential privacy based graph construction and noise-induced graph transformation to protect the privacy of knowledge graphs. Our approach unveils new state-of-the-art on MS-COCO dataset in various semi-supervised settings. In addition, we showcase a real experiment in the education domain to address the automation of sensitive document tagging. Experimental results show that our approach achieves an excellent balance of model accuracy and privacy preservation on both public and private datasets.

Abstract:
The scene graph which can be represented by a set of visual triples is composed of objects and the relations between object pairs. It is vital for image captioning, visual question answering, and many other applications. However, there is a long tail distribution on the scene graph dataset, and the tail relation cannot be accurately identified due to the lack of training samples. The problem of the nonstandard label and feature overlap on the scene graph affects the extraction of discriminative features and exacerbates the effect of data imbalance on the model. For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. We use the method of memory features to realize the transfer of high-frequency relation features to low-frequency relation features. Extensive experiments on scene graph datasets show that our model significantly improved the performance of two evaluation metrics R@K and mR@K compared with state-of-the-art baselines.

Abstract:
Zero-shot learning (ZSL) is to classify images according to detailed attribute annotations into new categories that are unseen during the training stage. Generalized zero-shot learning (GZSL) adds seen categories to the test samples. Since the learned classifier has inherent bias against seen categories, GZSL is more challenging than traditional ZSL. However, at present, there is no detailed attribute description dataset for video classification. Therefore, the current zero-shot video classification problem is based on the synthesis of generative adversarial networks trained on seen-class features into unseen-class features for ZSL classification. In order to solve this problem, we propose a description text dataset based on the UCF101 action recognition dataset. To the best of our knowledge, this is the first work to add description of the classes to zero-shot video classification. We propose a new loss function that combines visual features with textual features. We extract text features from the proposed text data set, and constrain the process of generating synthetic features based on the principle that videos with similar text types should be similar. Our method reapplies the traditional zero-shot learning idea to video classification. From the experimental point of view, our proposed dataset and method have a positive impact on the generalized zero-shot video classification.

Abstract:
Creative rhythmic transformations of musical audio refer to automated methods for manipulation of temporally-relevant sounds in time. This paper presents a method for joint synthesis and rhythm transformation of drum sounds through the use of adversarial autoencoders (AAE). Users may navigate both the timbre and rhythm of drum patterns in audio recordings through expressive control over a low-dimensional latent space. The model is based on an AAE with Gaussian mixture latent distributions that introduce rhythmic pattern conditioning to represent a wide variety of drum performances. The AAE is trained on a dataset of bar-length segments of percussion recordings, along with their clustered rhythmic pattern labels. The decoder is conditioned during adversarial training for mixing of data-driven rhythmic and timbral properties. The system is trained with over 500000 bars from 5418 tracks in popular datasets covering various musical genres. In an evaluation using real percussion recordings, the reconstruction accuracy and latent space interpolation between drum performances are investigated for audio generation conditioned by target rhythmic patterns.

Abstract:
In the field of multimedia, single image deraining is a basic pre-processing work, which can greatly improve the visual effect of subsequent high-level tasks in rainy conditions. In this paper, we propose an effective algorithm, called JDNet, to solve the single image deraining problem and conduct the segmentation and detection task for applications. Specifically, considering the important information on multi-scale features, we propose a Scale-Aggregation module to learn the features with different scales. Simultaneously, Self-Attention module is introduced to match or outperform their convolutional counterparts, which allows the feature aggregation to adapt to each channel. Furthermore, to improve the basic convolutional feature transformation process of Convolutional Neural Networks (CNNs), Self-Calibrated convolution is applied to build long-range spatial and inter-channel dependencies around each spatial location that explicitly expand fields-of-view of each convolutional layer through internal communications and hence enriches the output features. By designing the Scale-Aggregation and Self-Attention modules with Self-Calibrated convolution skillfully, the proposed model has better deraining results both on real-world and synthetic datasets. Extensive experiments are conducted to demonstrate the superiority of our method compared with state-of-the-art methods. The source code will be available at https://supercong94.wixsite.com/supercong94.

Abstract:
Although various stereo matching methods are studied in many years, the accurate 3D reconstruction from multiview stereos in high-fidelity is still challenging due to the surface inconsistency caused by various factors such as specular illumination. In this paper, we propose an accurate PatchMatch based multiview stereo matching method with a quadric support window that efficiently captures the surface of a complex structured object. Our method takes three novel contributions. Firstly, delicate surface configurations are used for representing the complex structure of an object. By using a general 3D quadric function, the structured object surfaces can be estimated more accurately. In addition, an illumination robust framework is proposed, where the patch dissimilarities are precisely measured with disentangled representation. The matching cost is defined based on disentangled measurements of the object photometric and geometric properties, balancing the pixel intensities between images robust to illumination. Lastly, a multiview propagation method is proposed to confirm shape consistency among views. Through the disparity refinement to unify plane parameters of the views, the object surface is estimated from a global perspective. Consequently, the dense and smooth 3D shape of the object is reconstructed accurately. We evaluate our proposed method on the Middlebury stereo set and conduct comprehensive experiments on facial images. Both quantitative and qualitative results demonstrate that the proposed method shows significant improvements over state-of-the-art methods.

Abstract:
Micro-expressions (MEs) are important clues for reflecting the real feelings of humans, and micro-expression recognition (MER) can thus be applied in various real-world applications. However, it is difficult to perceive and interpret MEs correctly. With the advance of deep learning technologies, the accuracy of micro-expression recognition is improved but still limited by the lack of large-scale datasets. In this paper, we propose a novel micro-expression recognition approach by combining Action Units (AUs) and emotion category labels. Specifically, based on facial muscle movements, we model different AUs based on relational information and integrate the AUs recognition task with MER. Besides, to overcome the shortcomings of limited and imbalanced training samples, we propose a data augmentation method that can generate nearly indistinguishable image sequences with AU intensity of real-world micro-expression images, which effectively improve the performance and are compatible with other micro-expression recognition methods. Experimental results on three mainstream micro-expression datasets, i.e., CASME II, SAMM, and SMIC, manifest that our approach outperforms other state-of-the-art methods on both single database and cross-database micro-expression recognition.

Abstract:
Automatic emotion recognition is an active research topic with wide range of applications. Due to the high manual annotation cost and inevitable label ambiguity, the development of emotion recognition dataset is limited in both scale and quality. Therefore, one of the key challenges is how to build effective models with limited data resource. Previous works have explored different approaches to tackle this challenge including data enhancement, transfer learning, and semi-supervised learning etc. However, the weakness of these existing approaches includes such as training instability, large performance loss during transfer, or marginal improvement. In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. We conduct extensive experiments to evaluate the proposed model on two benchmark datasets, IEMOCAP and MELD. The experiment results prove that the proposed semi-supervised learning model can effectively utilize unlabeled data and combine multi-modalities to boost the emotion recognition performance, which outperforms other state-of-the-art approaches under the same condition. The proposed model also achieves competitive capacity compared with existing approaches which take advantage of additional auxiliary information such as speaker and interaction context.

Abstract:
Emotions play a critical role in our everyday lives by altering how we perceive, process and respond to our environment. Affective computing aims to instill in computers the ability to detect and act on the emotions of users. A core aspect of any affective computing system is the classification of a user's emotion. In this study we present a novel methodology for classifying emotion in a conversation. At the backbone of our proposed methodology is a pre-trained Language Model (LM), which is supplemented by a Graph Convolutional Network (GCN) that propagates information over the predicate-argument structure identified in an utterance. We apply our proposed methodology on the IEMOCAP and Friends data sets, achieving state-of-the-art performance on the former and a higher accuracy on certain emotional labels on the latter. Furthermore, we examine the role context plays in our methodology by altering how much of the preceding conversation the model has access to when making a classification.

Abstract:
Multimodal facial action units (AU) recognition aims to build models that are capable of processing, correlating, and integrating information from multiple modalities (i.e., 2D images from a visual sensor, 3D geometry from 3D imaging, and thermal images from an infrared sensor). Although the multimodel data can provide rich information, there are two challenges that have to be addressed when learning from multimodal data: 1) the model must capture the complex cross-modal interactions in order to utilize the additional and mutual information effectively; 2) the model must be robust enough in the circumstance of unexpected data corruptions during testing, in case of a certain modality missing or being noisy. In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. The feature scoring module is designed to allow for evaluating the quality of features learned from multiple modalities. As a result, AMF is able to adaptively select more discriminative features, thus increasing the robustness to missing or corrupted modalities. In addition, to alleviate the over-fitting problem and make the model generalize better on the testing data, a cut-switch multimodal data augmentation method is designed, by which a random block is cut and switched across multiple modalities. We have conducted a thorough investigation on two public multimodal AU datasets, BP4D and BP4D+, and the results demonstrate the effectiveness of the proposed method. Ablation studies on various circumstances also show that our method remains robust to missing or noisy modalities during tests.

Abstract:
Gait recognition which is one of the most important and effective biometric technologies has a significant advantage in long-distance recognition systems. For existing gait recognition methods, the template-based approaches may lose temporal information, while the sequence-based methods cannot fully exploit the temporal relations among the sequence. To address the above issues, we propose a novel multiple-temporal-scale gait recognition framework which integrates the temporal information in multiple temporal scales, making use of both the frame and interval fusion information. Moreover, the interval-level representation is realized by a local transformation module. Concretely, 3D convolution neural network (3D CNN) is applied in both the small and the large temporal scales to extract the spatial-temporal information. Moreover, a frame pooling method is developed to address the mismatch of the input of 3D network and video frames, and a novel 3D basic network block is designed to improve efficiency. Experiments demonstrate that the multiple-temporal-scale 3D CNN based gait recognition method can achieve better performance than most recent state-of-the-art methods in CASIA-B dataset. The proposed method obtains the rank-1 accuracy with 96.7% under normal condition, and outperforms other methods on average accuracy by at least 5.8% and 11.1%, respectively, in complex scenarios.

Abstract:
Finding a suitable data representation for a specific task has been shown to be crucial in many applications. The success of subspace clustering depends on the assumption that the data can be separated into different subspaces. However, this simple assumption does not always hold since the raw data might not be separable into subspaces. To recover the "clustering-friendly" representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. Specifically, it injects graph similarity into data features by applying a low-pass filter to extract useful data representations for clustering. Extensive experiments on image and document clustering datasets demonstrate that our method improves upon state-of-the-art subspace clustering techniques. Especially, its comparable performance with deep learning methods emphasizes the effectiveness of the simple graph filtering scheme for many real-world applications. An ablation study shows that graph filtering can remove noise, preserve structure in the image, and increase the separability of classes.

Abstract:
As we use our hands frequently in daily activities, the analysis of hand-object interactions plays a critical role to many multimedia understanding and interaction applications. Different from conventional 3D hand-only and object-only pose estimation, estimating 3D hand-object pose is more challenging due to the mutual occlusions between hand and object, as well as the physical constraints between them. To overcome these issues, we propose to fully utilize the structural correlations among hand joints and object corners in order to obtain more reliable poses. Our work is inspired by structured output learning models in sequence transduction field like Transformer encoder-decoder framework. Besides modeling inherent dependencies from extracted 2D hand-object pose, our proposed Hand-Object Transformer Network (HOT-Net) also captures the structural correlations among 3D hand joints and object corners. Similar to Transformer's autoregressive decoder, by considering structured output patterns, this helps better constrain the output space and leads to more robust pose estimation. However, different from Transformer's sequential modeling mechanism, HOT-Net adopts a novel non-autoregressive decoding strategy for 3D hand-object pose estimation. Specifically, our model removes the Transformer's dependence on previously generated results and explicitly feeds a reference 3D hand-object pose into the decoding process to provide equivalent target pose patterns for parallely localizing each 3D keypoint. To further improve physical validity of estimated hand pose, besides anatomical constraints, we propose a cooperative pose constraint, aiming to enable the hand pose to cooperate with hand shape, to generate hand mesh. We demonstrate real-time speed and state-of-the-art performance on benchmark hand-object datasets for both 3D hand and object poses.

Abstract:
Although the General Face Recognition (GFR) research achieves great success, Age-Invariant Face Recognition (AIFR) is still a challenging problem since facial appearance changing over time brings significant intra-class variations. The existing discriminative methods for the AIFR task mostly focus on decomposing the facial feature from a sigle image into age-related feature and age-independent feature for recognition, which suffer from the loss of facial identity information. To address this issue, in this work we propose a novel Multi-Features Fusion and Decomposition (MFFD) framework to learn more discriminative feature representations and alleviate the intra-class variations for AIFR. Specifically, we first sample multiple face images of different ages with the same identity as a face time series. Next, we combine feature decomposition with fusion based on the face time series to ensure that the final age-independent features effectively represent the identity information of the face and have stronger robustness against aging. Moreover, we also present two feature fusion methods and several different training strategies to explore the impact on the model. Extensive experiments on several cross-age datasets (CACD, CACD-VS) demonstrate the effectiveness of our proposed method. Besides, our method also shows comparable generalization performance on the well-known LFW dataset.

Abstract:
Image to image translation aims to learn a mapping that transforms an image from one visual domain to another. Recent works assume that images descriptors can be disentangled into a domain-invariant content representation and a domain-specific style representation. Thus, translation models seek to preserve the content of source images while changing the style to a target visual domain. However, synthesizing new images is extremely challenging especially in multi-domain translations, as the network has to compose content and style to generate reliable and diverse images in multiple domains. In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. First, we train an image-to-image translation model to map images to multiple domains. Then, we train an image retrieval model using real and generated images to find images similar to a query one in content but in a different domain. Finally, we exploit the image retrieval system to fine-tune the image-to-image translation model and generate higher quality images. Our experiments show the effectiveness of the proposed solution and highlight the contribution of the retrieval network, which can benefit from additional unlabeled data and help image-to-image translation models in the presence of scarce data.

Abstract:
Though significant progress has been made in artistic style transfer, semantic information is usually difficult to be preserved in a fine-grained locally consistent manner by most existing methods, especially when multiple artists styles are required to transfer within one single model. To circumvent this issue, we propose a Stroke Control Multi-Artist Style Transfer framework. On the one hand, we design an Anisotropic Stroke Module (ASM) which realizes the dynamic adjustment of style-stroke between the non-trivial and the trivial regions. ASM endows the network with the ability of adaptive semantic-consistency among various styles. On the other hand, we present an novel Multi-Scale Projection Discriminator to realize the texture-level conditional generation. In contrast to the single-scale conditional discriminator, our discriminator is able to capture multi-scale texture clue to effectively distinguish a wide range of artistic styles. Extensive experimental results well demonstrate the feasibility and effectiveness of our approach. Our framework can transform a photograph into different artistic style oil painting via only ONE single model. Furthermore, the results are with distinctive artistic style and retain the anisotropic semantic information.

Abstract:
3D shape retrieval has attracted much research attention due to its wide applications in the fields of computer vision and multimedia. Various approaches have been proposed in recent years for learning 3D shape descriptor from different modalities. The existing works contain the following disadvantages: 1) the vast majority methods rely on the large scale of training data with clear category information; 2) many approaches focus on the fusion of multi-modal information but ignore the guidance of correlations among different modalities for shape representation learning; 3) many methods pay attention to the structural feature learning of 3D shape but ignore the guidance of structural similarity between every two shapes. To solve these problems, we propose a novel multi-graph network (MGN) for unsupervised 3D shape retrieval, which utilizes the correlations among modalities and structural similarity between two models to guide the shape representation learning process without category information. More specifically, we propose two novel loss functions: auto-correlation loss and cross-correlation loss. The auto-correlation loss utilizes information from different modalities to increase the discrimination of shape descriptor. The cross-correlation loss utilizes the structural similarity between two models to strengthen the intra-class similarity and increase the inter-class distinction. Finally, an effective similarity measurement is designed for the shape retrieval task. To validate the effectiveness of our proposed method, we conduct experiments on the ModelNet dataset. Experimental results demonstrate the effectiveness of our proposed method, and significant improvements have been achieved compared with state-of-the-art methods.

Abstract:
Indoor localization is a fundamental problem in location-based applications. Current approaches to this problem typically rely on Radio Frequency technology, which requires not only supporting infrastructures but human efforts to measure and calibrate the signal. Moreover, data collection for all locations is indispensable in existing methods, which in turn hinders their large-scale deployment. In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. GLN makes location predictions based on robust location representations extracted from images through message-passing networks. Furthermore, we introduce a novel zero-shot indoor localization setting and tackle it by extending the proposed GLN to a dedicated zero-shot version, which exploits a novel mechanism Map2Vec to train location-aware embeddings and make predictions on novel unseen locations. Our extensive experiments show that the proposed approach outperforms state-of-the-art methods in the standard setting, and achieves promising accuracy even in the zero-shot setting where data for half of the locations are not available. The source code and datasets are publicly available.

Abstract:
Text-based person search aims to retrieve the pedestrian images that best match a given textual description from gallery images. Previous methods utilize the soft-attention mechanism to infer the semantic alignments between the regions of image and the corresponding words in sentence. However, these methods may fuse the irrelevant multi-modality features together which cause matching redundancy problem. In this work, we propose a novel hierarchical Gumbel attention network for text-based person search via Gumbel top-k re-parameterization algorithm. Specifically, it adaptively selects the strong semantically relevant image regions and words/phrases from images and texts for precise alignment and similarity calculation. This hard selection strategy is able to fuse the strong-relevant multi-modality features for alleviating the problem of matching redundancy. Meanwhile, a Gumbel top-k re-parameterization algorithm is designed as a low-variance, unbiased gradient estimator to handle the discreteness problem of hard attention mechanism by an end-to-end manner. Moreover, a hierarchical adaptive matching strategy is employed by the model from three different granularities, i.e., word-level, phrase-level, and sentence-level, towards fine-grained matching. Extensive experimental results demonstrate the state-of-the-art performance. Compared the existed best method, we achieve the 8.24% Rank-1 and 7.6% mAP relative improvements in the text-to-image retrieval task, and 5.58% Rank-1 and 6.3% mAP relative improvements in the image-to-text retrieval task on CUHK-PEDES dataset, respectively.

Abstract:
In multimedia services, the introduction of haptic signals provides a more immersive user experience besides of conventional audio-visual perceptions. To support synchronous streaming and display of these information, it is imperative to efficiently compress and store the haptic signals, which promotes the development and optimization of haptic codecs. In this paper, we propose an end-to-end haptic codec for high-efficiency, low-delay and perception-lossless compression of kinesthetic signal, one of two major components of haptic signals. The proposed encoder consists of amplifier, DCT, quantizer, run-length encoder and entropy encoder, while the decoder includes all counterpart modules of the encoder. In particular, all parameters of these modules are deliberately calibrated aimed at a high compression efficiency of kinesthetic information. We allow a maximal DCT length of 8 samples, in order to guarantee a maximal encoding delay of 7ms for a popular haptic simulator of 1000Hz. Incorporating the model of perception deadband, the proposed codec is capable of realizing perception-lossless kinesthetic bitsteam. Finally, we examine the proposed codec on the standard database of IEEE P1918.1.1 Haptic Codecs Task Group. Comprehensive experiments reveal that our codec outperforms its rivals with 50% bit rate reduction, improved perception quality and a negligible encoder delay.

Abstract:
In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. It still poses several challenges due to real-world scenes with scale variations, occlusion and insufficient annotations. In particular, we address two main problems: 1) how to design a simple yet effective pipeline for densepose estimation; and 2) how to equip this pipeline with the ability of handling the issues of limited annotations and class-imbalanced labels. To tackle these problems, we develop a novel densepose estimation framework based on a two-stage pipeline, called Knowledge Transfer Network (KTN). Unlike existing works which directly propagate the pyramidal base features of regions, we enhance their representation power by a multi-instance decoder (MID). MID can well distinguish the target instance from other interference instances and background. Then, we introduce a knowledge transfer machine (KTM), which improves densepose estimation by utilizing the external commonsense knowledge. Notably, with the help of our knowledge transfer machine (KTM), current densepose estimation systems (either based on RCNN or fully-convolutional frameworks) can be improved in terms of the accuracy of human densepose estimation. Solid experiments on densepose estimation benchmarks demonstrate the superiority and generalizability of our approach. Our code and models will be publicly available.

Abstract:
Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the model from hallucinating object words in its description. However, such design mainly focuses on object word generation and thus may ignore fine-grained information and suffer from missing visual concepts. Moreover, relational words (e.g., 'jump left or right') are usual spatio-temporal inference results, i.e., these words cannot be grounded on certain spatial regions. To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. Furthermore, the refined graph can be regarded as relational inductive knowledge to assist captioning models in selecting the relevant information it needs to generate correct words. We validate the effectiveness of our model through automatic metrics and human evaluation, and the results indicate that our approach can generate more fine-grained and accurate description, and it solves the problem of object hallucination to some extent.

Abstract:
Adversarial attacks have been widely recognized as the security vulnerability of deep neural networks, especially in deep automatic speech recognition (ASR) systems. The advanced detection methods against adversarial attacks mainly focus on pre-processing the input audio to alleviate the threat of adversarial noise. Although these methods could detect some simplex adversarial attacks, they fail to handle robust complex attacks especially when the attacker knows the detection details. In this paper, we propose a unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation. Specifically, a well-designed adaptive artificial utterances generator is proposed to balance the design complexity, such that the artificial utterances (speech with reverberation) are efficiently determined to reduce the false positive rate and false negative rate of detection results. Moreover, to destroy the continuity of the adversarial noise, we develop a novel multi-noise padding strategy, which implants the Gaussian noises in the silent fragments of the input speech by the voice activity detector. Furthermore, our proposed method can effectively tackle the robust adaptive attacks in an adaptive learning manner. Importantly, the conceived system is easily embedded into any ASR models without requiring additional retraining or modification. The experimental results show that our method consistently outperforms the state-of-the-art audio defense methods, even for the adaptive and robust attacks.

Abstract:
Existing action localization approaches adopt shallow temporal convolutional networks (i.e., TCN) on 1D feature map extracted from video frames. In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. To address this issue, we introduce a novel concept-wise temporal convolutional network (C-TCN) as an alternative to TCN for training deeper action localization networks. To address this issue, we introduce a novel concept-wise temporal convolution (CTC) layer as an alternative to conventional temporal convolution layer for training deeper action localization networks. Instead of recombining latent concepts, CTC layer deploys a number of temporal filters to each concept separately with shared filter parameters across concepts. Thus can capture common temporal patterns of different concepts and significantly enrich representation ability. Via stacking CTC layers, we proposed a deep concept-wise temporal convolutional network (C-TCN), which boosts the state-of-the-art action localization performance on THUMOS'14 from 42.8 to 52.1 in terms of mAP(%), achieving a relative improvement of 21.7%. Favorable result is also obtained on ActivityNet.

Abstract:
Person search by language aims to associate the pedestrian images with free-form natural language descriptions. Although great efforts have been made to align images with sentences, most researchers neglect the difficulty of long-distance dependency modeling in textual encoding, which is very important for solving this problem because the description sentences are always long and have complex structures for distinguishing different pedestrians. In this work, we focus on the long-distance dependencies in a sentence for better textual encoding, and accordingly propose the Textual Dependency Embedding (TDE) method. We first employ the sentence analysis tools to figure out the long-distance syntactic dependencies from a dependent to its governor in a sentence. Then we embed the dependent representations to their governor adaptively in our Governor-guided Dependent Attention Module (GDAM) to model these long-distance relations. After that, we further consider the dependency types, which also tell the importance of different dependents semantically, and embed them together with the dependents' features to clarify their inequivalent contributions to their governor. Extensive experiments and analysis on person search by language and image-text matching have validated the effectiveness of our method, and we have obtained the state-of-the-art performance on the CUHK-PEDES and Flickr30K datasets.

Abstract:
Scene text recognition is the task of recognizing character sequences in images of natural scenes. The considerable diversity in the appearance of text in a scene image and potentially highly complex backgrounds make text recognition challenging. Previous approaches employ character sequence generators to analyze text regions and, subsequently, compare the candidate character sequences against a language model. In this work, we propose a bimodal framework that simultaneously utilizes visual and linguistic information to enhance recognition performance. Our linguistically aware learning (LAL) method effectively learns visual embeddings using a rectifier, encoder, and attention decoder approach, and linguistic embeddings, using a deep next-character prediction model. We present an innovative way of combining these two embeddings effectively. Our experiments on eight standard benchmarks show that our method outperforms previous methods by large margins, particularly on rotated, foreshortened, and curved text. We show that the bimodal approach has a statistically significant impact. We also contribute a new dataset, and show robust performance when LAL is combined with a text detector in a pipelined text spotting framework.

Abstract:
Real-time video enhancement is in great demand due to the extensive usage of live video applications, but existing approaches are far from satisfying the strict requirements of speed and stability. We present a novel convolutional network that can perform high-quality enhancement on 1080p videos at 45 FPS with a single CPU, which has high potential for real-world deployment. The proposed network is designed based on a light-weight image network and further consolidated for temporal consistency with a temporal feature aggregation (TFA) module. Unlike most image translation networks that use decoders to generate target images, our network discards decoders and employs only an encoder and a small head. The network predicts color mapping functions instead of pixel values in a grid-like container which fits the CNN structure well and also advances the enhancement to be scalable to any video resolution. Furthermore, the temporal consistency of the output will be enforced by the TFA module which utilizes the learned temporal coherence of semantics across frames. We also demonstrate that the mapping representation is general to various enhancement tasks, such as relighting, retouching and dehazing, on benchmark datasets. Our approach achieves the state-of-the-art performance and performs about 10 times faster than the current real-time method on high-resolution videos.

Abstract:
Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., top-down approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidate-level representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches.

Abstract:
Diversity of training data significantly affects tracking robustness of model under unconstrained environments. However, existing labeled datasets for facial landmark tracking tend to be large but not diverse, and manually annotating the massive clips of new diverse videos is extremely expensive. To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. Unlike most existing methods which rely heavily on labeled data, STKI exploits supervisions from unlabeled data. Specifically, STKI integrates spatial-temporal knowledge from massive unlabeled videos, which has several orders of magnitude more than existing labeled video data on the diversity, for robust tracking. Our framework includes a self-supervised tracker and an image-based detector for tracking initialization. To avoid the distortion of facial shape, the tracker leverages adversarial learning to introduce facial structure prior and temporal knowledge into cycle-consistency tracking. Meanwhile, we design a graph-based knowledge distillation method, which distills the knowledge from tracking and detection results, to improve the generalization of the detector. The fine-tuned detector can provide tracker on unconstrained videos with high-quality tracking initialization. Extensive experimental results show that the proposed method achieves state-of-the-art performance on comprehensive evaluation datasets.

Abstract:
Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, such as object proposal regions and grid regions. They ignore some fine-grained object information like shapes and poses, which are often described in language expressions and important to localize objects. Additionally, rectangular object regions usually contain background contents and irrelevant foreground features, which also decrease the localization performance. To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. Rather than extracting rectangular object representations, LDC adaptively samples a set of key points based on the image and language to represent objects. This type of object representations can capture more fine-grained object information (e.g., shapes and poses) and suppress noises in accordance with language and thus, boosts the object localization performance. Based on the language-aware fine-grained object representation, we next design a bidirectional interaction model (BIM) that leverages a modified co-attention mechanism to build cross-modal bidirectional interactions to further improve the language and object representations. Furthermore, we propose a hierarchical fine-grained representation network (HFRN) to learn language-aware fine-grained object representations and cross-modal bidirectional interactions at local word level and global sentence level, respectively. Our proposed method outperforms the state-of-the-art methods on the RefCOCO, RefCOCO+ and RefCOCOg datasets.

Abstract:
When we humans tell a long paragraph about an image, we usually first implicitly compose a mental "script'' and then comply with it to generate the paragraph. Inspired by this, we render the modern encoder-decoder based image paragraph captioning model such ability by proposing Hierarchical Scene Graph Encoder-Decoder (HSGED) for generating coherent and distinctive paragraphs. In particular, we use the image scene graph as the "script" to incorporate rich semantic knowledge and, more importantly, the hierarchical constraints into the model. Specifically, we design a sentence scene graph RNN (SSG-RNN) to generate sub-graph level topics, which constrain the word scene graph RNN (WSG-RNN) to generate the corresponding sentences. We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. An efficient sentence-level loss is also proposed for encouraging the sequence of generated sentences to be similar to that of the ground-truth paragraphs. We validate HSGED on Stanford image paragraph dataset and show that it not only achieves a new state-of-the-art 36.02 CIDEr-D, but also generates more coherent and distinctive paragraphs under various metrics.

Abstract:
Modeling human visual attention mechanism is a fundamental problem for the understanding of human vision, which has also been demonstrated as an important module for various multimedia applications such as image captioning and visual question answering. In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. We characterize the correlations between the visual input, the activated mode, the saliency and the spatial allocation of attention via a graphical model representation, based on which we explore the lingual guidance from captioning data for the implementation of a mode-sensitive attention (MSA) model. The proposed framework explicitly justifies the usage of center bias for fixation prediction and can convert an arbitrary learning-based backbone attention model to a more robust multi-mode version. Experimental results on the York120, MIT1003 and PASCAL datasets demonstrate the effectiveness of the proposed method.

Abstract:
In this paper, we investigate the fragility of deep image captioning models against adversarial attacks. Different from existing works that generate common words and concepts, we focus on the adversarial attacks towards controllable image captioning, i.e., removing target words from captions by imposing adversarial noises to images while maintaining the captioning accuracy for the remaining visual content. We name this new task as Masked Image Captioning (MIC), which is expected to be training and labeling free for end-to-end captioning models. Meanwhile, we propose a novel adversarial learning approach for this new task, termed Show, Mask, and Tell (SMT), which crafts adversarial examples to mask the target concepts via minimizing an objective loss while training the noise generator. Concretely, three novel designs are introduced in this loss, i.e., word removal regularization, captioning accuracy regularization, and noise filtering regularization. For quantitative validation, we propose a benchmark dataset for MIC based on the MS COCO dataset, together with a new evaluation metric called Attack Quality. Experimental results show that the proposed approach achieves successful attacks by removing 93.8% and 91.9% target words while maintaining 97.3% and 97.4% accuracies on two cutting-edge captioning models, respectively.

Abstract:
A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention by sharpening question-related attention and adjusting it by answer-based logical operation at each turn. Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF), where overall information and difference information are fused conditioning on the question-answer state. We evaluate the proposed ADVSE to both question generator and guesser tasks on the large-scale GuessWhat?! dataset and achieve the state-of-the-art performances on both tasks. The qualitative results indicate that the ADVSE boosts the agent to generate highly efficient questions and obtains reliable visual attentions during the reasonable question generation and guess processes.

Abstract:
Previous Person Re-Identification (Re-ID) models aim to focus on the most discriminative region of an image, while its performance may be compromised when that region is missing caused by camera viewpoint changes or occlusion. To solve this issue, we propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other. First, the correlation maps of cross-level feature-pairs are modeled via low-rank bilinear pooling. Then, based on the correlation maps, Bi-directional Feature Perception (BFP) module is employed to enrich the attention regions of high-level feature, and to learn abstract and specific information in low-level feature. And then, we propose a novel end-to-end hierarchical network which integrates multi-level augmented features and inputs the augmented low- and middle-level features to following layers to retrain a new powerful network. What's more, we propose a novel trainable generalized pooling, which can dynamically select any value of all locations in feature maps to be activated. Extensive experiments implemented on the mainstream evaluation datasets including Market-1501, CUHK03 and DukeMTMC-ReID show that our method outperforms the recent SOTA Re-ID models.

Abstract:
As the GAN-based face image and video generation techniques, widely known as DeepFakes, have become more and more matured and realistic, there comes a pressing and urgent demand for effective DeepFakes detectors. Motivated by the fact that remote visual photoplethysmography (PPG) is made possible by monitoring the minuscule periodic changes of skin color due to blood pumping through the face, we conjecture that normal heartbeat rhythms found in the real face videos will be disrupted or even entirely broken in a DeepFake video, making it a potentially powerful indicator for DeepFake detection. In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. DeepRhythm utilizes dual-spatial-temporal attention to adapt to dynamically changing face and fake types. Extensive experiments on FaceForensics++ and DFDC-preview datasets have confirmed our conjecture and demonstrated not only the effectiveness, but also the generalization capability of DeepRhythm over different datasets by various DeepFakes generation techniques and multifarious challenging degradations.

Abstract:
We present a novel Chinese calligraphy artwork composition system (MaLiang) which can generate aesthetic, stylistic and diverse calligraphy images based on the emotion status from the input text. Different from previous research, it's the first work to endow the calligraphy synthesis with the ability to express fickle emotions and composite a whole piece of discourse-level calligraphy artwork instead of single character images. The system consists of three modules: emotion detection, character image generation, and layout prediction. As a creative form of interactive art, MaLiang has been exhibited in several famous international art festivals.

Abstract:
MLModelCI provides multimedia researchers and developers with a one-stop platform for efficient machine learning (ML) services. The system leverages DevOps techniques to optimize, test, and manage models. It also containerizes and deploys these optimized and validated models as cloud services (MLaaS). In its essence, MLModelCI serves as a housekeeper to help users publish models. The models are first automatically converted to optimized formats for production purpose and then profiled under different settings (e.g., batch size and hardware). The profiling information can be used as guidelines for balancing the trade-off between performance and cost of MLaaS. Finally, the system dockerizes the models for ease of deployment to cloud environments. A key feature of MLModelCI is the implementation of a controller, which allows elastic evaluation which only utilizes idle workers while maintaining online service quality. Our system bridges the gap between current ML training and serving systems and thus free developers from manual and tedious work often associated with service deployment. We release the platform as an open-source project on GitHub under Apache 2.0 license, with the aim that it will facilitate and streamline more large-scale ML applications and research projects.

Abstract:
Person re-identification has received much attention in the last few years, as it enhances the retrieval effectiveness in the video surveillance networks and video archive management. In this paper, we demonstrate a guiding robot with person followers system, which recognizes the follower using a person re-identification technology. It first adopts existing face recognition and person tracking methods to generate person tracklets with different IDs. Then, a classic person re-identification model, pre-trained on the surveillance dataset, is adapted to the new robot vision condition incrementally. The demonstration showcases the quality of robot follower focusing.

Abstract:
Multimedia content production is nowadays widespread due to technological advances, namely supported by smartphones and social media. Although the massive amount of media content brings new opportunities to the industry, it also obfuscates the relevance of marketing content, meant to maintain and lure new audiences. This leads to an emergent necessity of producing these kinds of contents as quickly and engagingly as possible. Creating these automatically would decrease both the production costs and time, particularly by using static media for the creation of short storytelling animated clips. We propose an innovative approach that uses context and content information to transform a still photo into an appealing context-aware video clip. Thus, our solution presents a contribution to the state-of-the-art in computer vision and multimedia technologies and assists content creators with a value-added service to automatically build rich contextualized multimedia stories from single photographs.

Abstract:
We demonstrate a video 360 navigation and streaming system for Mobile HMD devices. The Navigation Graph (NG) concept is used to predict future views that use a graph model that captures both temporal and spatial viewing behavior of prior viewers. Visualization of video 360 content navigation and view prediction algorithms is used for assessment of Quality of Experience (QoE) and evaluation of the accuracy of the NG-based view prediction algorithm.

Abstract:
This technical demo will present the Object Detection Kit, a system capable of collecting, analyzing and distributing street level imagery in real-time. It provides civil servants with the actionable intelligence about issues on city streets and, at the same time, equips the multimedia research community with a framework and data facilitating easy deployment and testing of algorithms in a challenging urban setting. The system is available as open source. In the Object Detection Kit demo we will demonstrate how the framework can be used to detect urban issues and showcase the capabilities of the system.

Abstract:
Live sports broadcasting is the live coverage of sports (e.g., a soccer match) as a television program, on various types of broadcasting media (e.g., television or internet). Directing such live sports broadcast is cost-expensive and demands experienced sports directors with sufficient broadcasting skills. In this paper, we demonstrate an end-to-end intelligent system for live sports broadcasting, namely iDirector, which aims to mimic the human-in-loop live broadcasting process by aggregating the input multi-camera video streaming into the final output program video (PGM video) for audience. We construct this system as an event-driven pipeline with three modules: video decoder, video analyzer, and broadcasting controller. Specifically, given the multi-view video streaming captured from cameras placing in the stadium, video decoder module first decodes the input video streaming into a series of frames and clips. Next, video analyzer performs multiple pre-learned models in parallel for frame- and clip-level content understanding (e.g., events localization and highlight detection). Based on all the analytic results across frames and clips, with only 30 seconds looking ahead, broadcasting controller automatically produces the broadcast videos via camera view switch, playback and slow-motion. When some high-profile events (e.g., free kick) happen, broadcasting controller will render visual effects on PGM video to enhance audiences' entertained pleasure.

Abstract:
The current practical approaches for depth-aware pose estimation convert a human pose from a monocular 2D image into 3D space with a single computationally intensive convolutional neural network (CNN). This paper introduces the first open-source algorithm for binocular 3D pose estimation. It uses two separate lightweight CNNs to estimate disparity/depth information from a stereoscopic camera input. This multi-CNN fusion scheme makes it possible to perform full-depth sensing in real time on a consumer-grade laptop even if parts of the human body are invisible or occluded. Our real-time system is validated with a proof-of-concept demonstrator that is composed of two Logitech C930e webcams and a laptop equipped with Nvidia GTX1650 MaxQ GPU and Intel i7-9750H CPU. The demonstrator is able to process the input camera feeds at 30 fps and the output can be visually analyzed with a dedicated 3D pose visualizer.

Abstract:
ConfFlow is an interactive web application that allows conference participants to inspect other attendees through a visualized similarity space. The construction of the similarity space is done in a similar manner to the well-known Toronto Paper Matching System (TPMS) and based on the publicly available former publications of the attendees, obtained by crawling through the Web. ConfFlow aims to help attendees initiate new connections and collaborations with participants that have similar and/or complementary research interests. It has multiple functionalities that allow users to customize their experience and identify the perfect connection for their next collaboration.

Abstract:
Multiple Object Tracking (MOT) is an important task in computer vision. MOT is still challenging due to the occlusion problem, especially in dense scenes. Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. First, we design the Layer-wise Aggregation Discriminative Model (LADM) to filter the noisy detections. Then, to associate remaining detections correctly, we introduce the Global Attention Feature Model (GAFM) to extract appearance feature and use it to calculate the appearance similarity between history tracklets and current detections. Finally, we propose the Box-Plane Matching strategy to achieve data association according to the motion similarity and appearance similarity between tracklets and detections. With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.

Abstract:
Abnormal event detection is a non-trivial task in machine learning. The primary reason behind this is that the abnormal class occurs sparsely, and its temporal location may not be available. In this paper, we propose a multiple feature-based approach for CitySCENE challenge-based anomaly detection. For motion and context information, Res3D and Res101 architectures are used. Object-level information is extracted by object detection feature-based pooling. Fusion of three channels above gives relatively high performance on the challenge Test set for the general anomaly task. We also show how our method can be used for temporal localisation of the abnormal activity event in a video.

Abstract:
The First International Workshop on Human-Centric MultimediaAnalysis is concentrated on the tasks of human-centric analysis with multimedia and multimodal information. It is one of the fundamental and challenging problems of multimedia understanding. The human-centric multimedia analysis involves multiple tasks such as face detection and recognition, human body pattern analysis, person re-identification, human action detection, person tracking,human-object interaction, and so on. Today, multiple multimedia sensing technologies and large-scale computing infrastructures are producing at a rapid velocity a wide variety of big multi-modality data for human-centric analysis, which provides rich knowledge to help tackle these challenges. Researchers have strived to push the limits of human-centric multimedia analysis in a wide variety of applications, such as intelligent surveillance, retailing, fashion design, and services. Therefore, this workshop aims to provide a platform to bridge the gap between the communities of human analysis and multimedia.

Abstract:
Recently, conversational systems have seen a significant rise in demand due to modern commercial applications using systems such as Amazon's Alexa, Apple's Siri, Microsoft's Cortana and Google Assistant. The research on multimodal chatbots is a widely underexplored area, where users and the conversational agent communicate by natural language and visual data. Conversational agents are now becoming a commodity as a number of companies push for this technology. The wide use of these conversational agents exposes the many challenges in achieving more natural, human-like, and engaging conversational agents. The research community is actively addressing several of these challenges: how are visual and text data related in user utterances? How to interpret the user intent? How to encode multimodal dialog status? What are the ethical and legal aspects of conversational AI? The Multimodal Conversational AI workshop will be a forum where researchers and practitioners share their experiences and brainstorm about success and failures in the topic. It will also promote collaboration to strengthen the conversational AI community at ACM Multimedia.

Abstract:
Neural network architecture design is playing the most important role in recent fast development of multimedia technology. In this talk, I mainly introduce the research and development efforts in designing neural networks from two orthogonal lines: 1) how these neural network models are bio-inspired, e.g. the 1x1 convolution simulates the function of cell-body of a neuron, and 2) how these models are more hardware-friendly or motivating the next-generation of AI chips, e.g. the selective convolution is expecting new design of hardware. These two lines of efforts are collaboratively enhancing the overall efficiency of multimedia systems.

Abstract:
raph Convolutional Networks (GCNs) have attracted increasing interests for the task of skeleton-based action recognition. The key lies in the design of the graph structure, which encodes skeleton topology information. In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn skeleton topology automatically. In particular, when learning the dependency between two joints, contextual features from the rest joints are incorporated in a global manner. CeN is extremely lightweight yet effective, and can be embedded into a graph convolutional layer. By stacking multiple CeN-enabled graph convolutional layers, we build Dynamic GCN. Notably, as a merit of CeN, dynamic graph topologies are constructed for different input samples as well as graph convolutional layers of various depths. Besides, three alternative context modeling architectures are well explored, which may serve as a guideline for future research on graph topology learning. CeN brings only ~7% extra FLOPs for the baseline model, and Dynamic GCN achieves better performance with 2x ~4x fewer FLOPs than existing methods. By further combining static physical body connections and motion modalities, we achieve state-of-the-art performance on three large-scale benchmarks, namely NTU-RGB+D, NTU-RGB+D 120 and Skeleton-Kinetics.

Abstract:
Image cropping is an effective tool to edit and manipulate images to achieve better aesthetic quality. Most existing cropping approaches rely on the two-step paradigm where multiple candidate cropping areas are proposed initially and the optimal cropping window is determined based on some quality criteria for these candidates afterwards. The obvious disadvantage of this mechanism is its low efficiency due to the huge searching space of candidate crops. In order to tackle this problem, a weakly supervised cropping framework is proposed, where the distribution dissimilarity between high quality images and cropped images is used to guide the coordinate predictor's training and the ground truths of cropping windows are not required by the proposed method. Meanwhile, to improve the cropping performance, a saliency loss is also designed in the proposed framework to force the neural network to focus more on the interested objects in the image. Under this framework, the images can be cropped effectively by the trained coordinate predictor in a one-pass favor without multiple candidates proposals, which ensures the high efficiency of the proposed system . Also, based on the proposed framework, many existing distribution dissimilarity measurements can be applied to train the image cropping system with high flexibility, such as likelihood based and divergence based distribution dissimilarity measure proposed in this work. The experiments on the public databases show that the proposed cropping method achieves the state-of-the-art accuracy, and the high computation efficiency as fast as 285 FPS is also obtained.

Abstract:
Trajectory prediction is a highly desirable feature for safe navigation or autonomous vehicle in complex traffic. In this paper, we consider the practical environment of predicting trajectory in the heterogeneous traffic ecology. The proposed method has various applications in trajectory prediction problems and also in applied fields beyond tracking. One challenge stands out of the trajectory prediction-heterogeneous environment. Particularly, many factors should be considered in the environments, i.e., multiple types of road-agents, social interactions and terrains. The information is complicated and large that may result in inaccurate trajectory prediction. We propose two social and visual enforced attention modules to circumvent the problem and a variant of an Info-GAN structure to predict the trajectory with multi-modal behaviors. Experimental results show that the proposed method significantly outperforms state-of-the-art methods in both heterogeneous and homogeneous real environments.

Abstract:
Today's scene graph generation (SGG) task is largely limited in realistic scenarios, mainly due to the extremely long-tailed bias of predicate annotation distribution. Thus, tackling the class imbalance trouble of SGG is critical and challenging. In this paper, we first discover that when predicate labels have strong correlation with each other, prevalent re-balancing strategies (e.g., re-sampling and re-weighting) will give rise to either over-fitting the tail data (e.g., bench sitting on sidewalk rather than on), or still suffering the adverse effect from the original uneven distribution (e.g., aggregating varied parked on/standing on/sitting on into on). We argue the principal reason is that re-balancing strategies are sensitive to the frequencies of predicates yet blind to their relatedness, which may play a more important role to promote the learning of predicate features. Therefore, we propose a novel Predicate-Correlation Perception Learning (PCPL for short) scheme to adaptively seek out appropriate loss weights by directly perceiving and utilizing the correlation among predicate classes. Moreover, our PCPL framework is further equipped with a graph encoder module to better extract context features. Extensive experiments on the benchmark VG150 dataset show that the proposed PCPL performs markedly better on tail classes while well-preserving the performance on head ones, which significantly outperforms previous state-of-the-art methods.

Abstract:
Multi-label image classification is an important and challenging task in computer vision and multimedia fields. Most of the recent works only capture the pair-wise dependencies among multiple labels through statistical co-occurrence information, which cannot model the high-order semantic relations automatically. In this paper, we propose a high-order semantic learning model based on adaptive hypergraph neural networks (AdaHGNN) to boost multi-label classification performance. Firstly, an adaptive hypergraph is constructed by using label embeddings automatically. Secondly, image features are decoupled into feature vectors corresponding to each label, and hypergraph neural networks (HGNN) are employed to correlate these vectors and explore the high-order semantic interactions. In addition, multi-scale learning is used to reduce sensitivity to object size inconsistencies. Experiments are conducted on four benchmarks: MS-COCO, NUS-WIDE, Visual Genome, and Pascal VOC 2007, which cover large, medium, and small-scale categories. State-of-the-art performances are achieved on three of them. Results and analysis demonstrate that the proposed method has the ability to capture high-order semantic dependencies.

Abstract:
Conditional image generation is an active research topic including text2image and image translation. Recently image manipulation with linguistic instruction brings new challenges of multimodal conditional generation. However, traditional conditional image generation models mainly focus on generating high-quality and visually realistic images, and lack resolving the partial consistency between image and instruction. To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. First, we introduce the word-level and instruction-level instruction encoders to learn user's intention from history-correlated instructions as semantic increment. Second, we embed the representation of semantic increment into that of source image for generating target image, where source image plays the role of referring auxiliary. Finally, we propose a reasoning discriminator to measure the consistency between visual increment and semantic increment, which purifies user's intention and guarantees the good logic of generated target image. Extensive experiments and visualization conducted on two datasets show the effectiveness of IR-GAN.

Abstract:
We present an efficient finetuning methodology for neural-network filters which are applied as a postprocessing artifact-removal step in video coding pipelines. The fine-tuning is performed at encoder side to adapt the neural network to the specific content that is being encoded. In order to maximize the PSNR gain and minimize the bitrate overhead, we propose to finetune only the convolutional layers' biases. The proposed method achieves convergence much faster than conventional finetuning approaches, making it suitable for practical applications. The weight-update can be included into the video bitstream generatedby the existing video codecs. We show that our method achieves up to 9.7% average BD-rate gain when compared to the state-of-art Versatile Video Coding (VVC) standard codec on 7 test sequences.

Abstract:
Recognizing freehand sketches with high arbitrariness is such a great challenge that the automatic recognition rate has reached a ceiling in recent years. In this paper, we explicitly explore the shape properties of sketches, which has almost been neglected before in the context of deep learning, and propose a sequential dual learning strategy that combines both shape and texture features. We devise a two-stage recurrent neural network to balance these two types of features. Our architecture also considers stroke orders of sketches to reduce the intra-class variations of input features. Extensive experiments on the TU-Berlin benchmark set show that our method achieves over 90% recognition rate for the first time on this task, outperforming both humans and state-of-the-art algorithms by over 19 and 7.5 percentage points, respectively. Especially, our approach can distinguish the sketches with similar textures but different shapes more effectively than recent deep networks. Based on the proposed method, we develop an on-line sketch retrieval and imitation application to teach children or adults to draw. The application is available as Sketch.Draw.

Abstract:
Multi-modal utterance-level emotion detection has been a hot research topic in both multi-modal analysis and natural language processing communities. Different from traditional single-label multi-modal sentiment analysis, typical multi-modal emotion detection is naturally a multi-label problem where an utterance often contains multiple emotions. Existing studies normally focus on multi-modal fusion only and transform multi-label emotion classification into multiple binary classification problem independently. As a result, existing studies largely ignore two kinds of important dependency information: (1) Modality-to-label dependency, where different emotions can be inferred from different modalities, that is, different modalities contribute differently to each potential emotion. (2) Label-to-label dependency, where some emotions are more likely to coexist than those conflicting emotions. To simultaneously model above two kinds of dependency, we propose a unified approach, namely multi-modal emotion set generation network (MESGN) to generate an emotion set for an utterance. Specifically, we first employ a cross-modal transformer encoder to capture cross-modal interactions among different modalities, and a standard transformer encoder to capture temporal information for each modality-specific sequence given previous interactions. Then, we design a transformer-based discriminative decoding module equipped with modality-to-label attention to handle the modality-to-label dependency. In the meanwhile, we employ a reinforced decoding algorithm with self-critic learning to handle the label-to-label dependency. Finally, we validate the proposed MESGN architecture on a word-level aligned and unaligned multi-modal dataset. Detailed experimentation shows that our proposed MESGN architecture can effectively improve the performance of multi-modal multi-label emotion detection.

Abstract:
Most metric-based meta-learning methods learn only the sophisticated similarity metric for few-shot classification, which may lead to the feature deterioration and unreliable prediction. Toward this end, we propose new mechanisms to learn generalized and discriminative feature embeddings as well as improve the robustness of classifiers against prediction corruptions for meta-learning. For this purpose, a new generation operator BlockMix is proposed by integrating interpolation on the images and labels within metric learning. Based on the above BlockMix, we propose a novel regularization method Meta Regularization as an auxiliary task branch with its own classifier to better constraint the feature embedding module and stabilize the meta-learning process. Furthermore, a novel inference scheme Self-Calibrated Inference is proposed to alleviate the unreliable prediction problem by calibrating the prototype of each category with the confidence-weighted average of the support and generated samples. The proposed mechanisms can be used as supplementary techniques alongside standard metric-based meta-learning algorithms without any pre-training. Experimental results demonstrate the insights and the efficiency of the proposed mechanisms respectively, compared with the state-of-the-art methods on the prevalent few-shot benchmarks.

Abstract:
Vision-based dynamic pedestrian intrusion detection (PID), judging whether pedestrians intrude an area-of-interest (AoI) by a moving camera, is an important task in mobile surveillance. The dynamically changing AoIs and a number of pedestrians in video frames increase the difficulty and computational complexity of determining whether pedestrians intrude the AoI, which makes previous algorithms incapable of this task. In this paper, we propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem. PIDNet is mainly designed by considering two factors: accurately segmenting the dynamically changing AoIs from a video frame captured by the moving camera and quickly detecting pedestrians from the generated AoI-contained areas. Three efficient network designs are proposed and incorporated into PIDNet to reduce the computational complexity: 1) a special PID task backbone for feature sharing, 2) a feature cropping module for feature cropping, and 3) a lighter detection branch network for feature compression. In addition, considering there are no public datasets and benchmarks in this field, we establish a benchmark dataset to evaluate the proposed network and give the corresponding evaluation metrics for the first time. Experimental results show that PIDNet can achieve 67.1% PID accuracy and 9.6 fps inference speed on the proposed dataset, which serves as a good baseline for the future vision-based dynamic PID study.

Abstract:
We have seen a rise in video based user communication in the last year, unfortunately fueled by the spread of COVID-19 disease. Efficient low-latency delay of transmission of video is a challenging problem which must also deal with the segmented nature of network infrastructure not always allowing a high throughput. Lossy video compression is a basic requirement to enable such technology widely. While this may compromise the quality of the streamed video there are recent deep learning based solutions to restore quality of a lossy compressed video.

Abstract:
Federated learning is a privacy-preserving machine learning technique that learns a shared model across decentralized clients. It can alleviate privacy concerns of personal re-identification, an important computer vision task. In this work, we implement federated learning to person re-identification (FedReID) and optimize its performance affected by statistical heterogeneity in the real-world scenario. We first construct a new benchmark to investigate the performance of FedReID. This benchmark consists of (1) nine datasets with different volumes sourced from different domains to simulate the heterogeneous situation in reality, (2) two federated scenarios, and (3) an enhanced federated algorithm for FedReID. The benchmark analysis shows that the client-edge-cloud architecture, represented by the federated-by-dataset scenario, has better performance than client-server architecture in FedReID. It also reveals the bottlenecks of FedReID under the real-world scenario, including poor performance of large datasets caused by unbalanced weights in model aggregation and challenges in convergence. Then we propose two optimization methods: (1) To address the unbalanced weight problem, we propose a new method to dynamically change the weights according to the scale of model changes in clients in each training round; (2) To facilitate convergence, we adopt knowledge distillation to refine the server model with knowledge generated from client models on a public dataset. Experiment results demonstrate that our strategies can achieve much better convergence with superior performance on all datasets. We believe that our work will inspire the community to further explore the implementation of federated learning on more computer vision tasks in real-world scenarios.

Abstract:
Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.

Abstract:
Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.

Abstract:
Data inconsistency and bias are inevitable among different facial expression recognition (FER) datasets due to subjective annotating process and different collecting conditions. Recent works resort to adversarial mechanisms that learn domain-invariant features to mitigate domain shift. However, most of these works focus on holistic feature adaptation, and they ignore local features that are more transferable across different datasets. Moreover, local features carry more detailed and discriminative content for expression recognition, and thus integrating local features may enable fine-grained adaptation. In this work, we propose a novel Adversarial Graph Representation Adaptation (AGRA) framework that unifies graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation. To achieve this, we first build a graph to correlate holistic and local regions within each domain and another graph to correlate these regions across different domains. Then, we learn the per-class statistical distribution of each domain and extract holistic-local features from the input image to initialize the corresponding graph nodes. Finally, we introduce two stacked graph convolution networks to propagate holistic-local feature within each domain to explore their interaction and across different domains for holistic-local feature co-adaptation. In this way, the AGRA framework can adaptively learn fine-grained domain-invariant features and thus facilitate cross-domain expression recognition. We conduct extensive and fair experiments on several popular benchmarks and show that the proposed AGRA framework achieves superior performance over previous state-of-the-art methods.

Abstract:
Deep neural networks (DNNs) have shown serious vulnerability to adversarial examples with imperceptible perturbation to clean images. Most existing input-transformation based defense methods (e.g., ComDefend) rely heavily on the learned external priors from an external large training dataset, while neglecting the rich image internal priors of the input itself, thus limiting the generalization of the defense models against the adversarial examples with biased image statistics from the external training dataset. Motivated by deep image prior that can capture rich image statistics from a single image, we propose an effective Deep Image Prior Driven Defense (DIPDefend) method against adversarial examples. With a DIP generator to fit the target/adversarial input, we find that our image reconstruction exhibits quite interesting learning preference from a feature learning perspectives, i.e., the early stage primarily learns the robust features resistant to adversarial perturbation, followed by learning non-robust features that are sensitive to adversarial perturbation. Besides, we develop an adaptive stopping strategy that adapts our method to diverse images. In this way, the proposed model obtains a unique defender for each individual adversarial input, thus being robust to various attackers. Experimental results demonstrate the superiority of our method over the state-of-the-art defense methods against white-box and black-box adversarial attacks.

Abstract:
This paper proposes a new light-weight convolutional neural network (~5k params) for non-uniform illumination image enhancement to handle color, exposure, contrast, noise and artifacts, etc., simultaneously and effectively. More concretely, the input image is first enhanced using Retinex model from dual different aspects (enhancing under-exposure and suppressing over-exposure), respectively. Then, these two enhanced results and the original image are fused to obtain an image with satisfactory brightness, contrast and details. Finally, the extra noise and compression artifacts are removed to get the final result. To train this network, we propose a semi-supervised retouching solution and construct a new dataset (~82k images) that contains various scenes and light conditions. Our model can enhance 0.5 mega-pixel (like 600×800) images in real-time (~50 fps), which is faster than existing enhancement methods. Extensive experiments show that our solution is fast and effective to deal with non-uniform illumination images.

Abstract:
Continuous sign language recognition (SLR) deals with unaligned video-text pair and uses the word error rate (WER), i.e., edit distance, as the main evaluation metric. Since it is not differentiable, we usually instead optimize the learning model with the connectionist temporal classification (CTC) objective loss, which maximizes the posterior probability over the sequential alignment. Due to the optimization gap, the predicted sentence with the highest decoding probability may not be the best choice under the WER metric. To tackle this issue, we propose a novel architecture with cross modality augmentation. Specifically, we first augment cross-modal data by simulating the calculation procedure of WER, i.e., substitution, deletion and insertion on both text label and its corresponding video. With these real and generated pseudo video-text pairs, we propose multiple loss terms to minimize the cross modality distance between the video and ground truth label, and make the network distinguish the difference between real and pseudo modalities. The proposed framework can be easily extended to other existing CTC based continuous SLR architectures. Extensive experiments on two continuous SLR benchmarks, i.e., RWTH-PHOENIX-Weather and CSL, validate the effectiveness of our proposed method.

Abstract:
Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.

Abstract:
Many studies on deep neural networks have shown very promising results for most image recognition tasks. However, these networks can often be fooled by adversarial examples that simply add small but powerful distortions to the original input. Recent works have demonstrated the vulnerability of deep learning systems to adversarial examples, but most such works directly manipulate and attack the digital images for a specific classifier only, and cannot attack the physical images in real world. In this paper, we propose the multi-sample ensemble method (MSEM) and most-likely ensemble method (MLEM) to generate adversarial attacks that successfully fool the classifier for images in both the digital and real worlds. The proposed adaptive norm algorithm can craft faster and smaller perturbation than other state-of-the-art attack methods. Besides, the proposed MLEM extended with weighted objective function can generate robust adversarial attacks that can mislead multiple classifiers (Inception-v3, Inception-v4, Resnet-v2, Ince-res-v2) simultaneously for physical images in real world. Compared with other methods, experiments show that our adversarial attack methods not only can achieve higher success rates but also can survive in the multi-model defense tests.

Abstract:
Face reenactment aims to animate a source face image to a different pose and expression provided by a driving image. Existing approaches are either designed for a specific identity, or suffer from the identity preservation problem in the one-shot or few-shot scenarios. In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh and driving mesh) as guidance to learn the optical flow needed for the reenacted face synthesis. Technically, we explicitly exclude the driving face's identity information in the reconstructed driving mesh. In this way, our network can focus on the motion estimation for the source face without the interference of driving face shape. We propose a motion net to learn the face motion, which is an asymmetric autoencoder. The encoder is a graph convolutional network (GCN) that learns a latent motion vector from the meshes, and the decoder serves to produce an optical flow image from the latent vector with CNNs. Compared to previous methods using sparse keypoints to guide the optical flow learning, our motion net learns the optical flow directly from 3D dense meshes, which provide the detailed shape and pose information for the optical flow, so it can achieve more accurate expression and pose on the reenacted face. Extensive experiments show that our method can generate high-quality results and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

Abstract:
To exploit rich information from unlabeled data, in this work, we propose a novel self-supervised framework for visual tracking which can easily adapt the state-of-the-art supervised Siamese-based trackers into unsupervised ones by utilizing the fact that an image and any cropped region of it can form a natural pair for self-training. Besides common geometric transformation-based data augmentation and hard negative mining, we also propose adversarial masking which helps the tracker to learn other context information by adaptively blacking out salient regions of the target. The proposed approach can be trained offline using images only without any requirement of manual annotations and temporal information from multiple consecutive frames. Thus, it can be used with any kind of unlabeled data, including images and video frames. For evaluation, we take SiamFC as the base tracker and name the proposed self-supervised method as S2SiamFC. Extensive experiments and ablation studies on the challenging VOT2016 and VOT2018 datasets are provided to demonstrate the effectiveness of the proposed method which not only achieves comparable performance to its supervised counterpart and other unsupervised methods requiring multiple frames.

Abstract:
Lip reading aims to recognize text from talking lip, while lip generation aims to synthesize talking lip according to text, which is a key component in talking face generation and is a dual task of lip reading. Both tasks require a large amount of paired lip video and text training data, and perform poorly in low-resource scenarios with limited paired training data. In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. The key ideas of the DualLip include: 1) Generate lip video from unlabeled text using a lip generation model, and use the pseudo data pairs to improve lip reading; 2) Generate text from unlabeled lip video using a lip reading model, and use the pseudo data pairs to improve lip generation. To leverage the benefit of DualLip on lip generation, we further extend DualLip to talking face generation with two additionally introduced components: lip to face generation and text to speech generation, which share the same duration for synchronization. Experiments on GRID and TCD-TIMIT datasets demonstrate the effectiveness of DualLip on improving lip reading, lip generation and talking face generation by utilizing unlabeled data, especially in low-resource scenarios. Specifically, on the GRID dataset, the lip generation model in our DualLip system trained with only 10% paired data and 90% unpaired data surpasses the performance of that trained with the whole paired data, and our lip reading model achieves 1.16% character error rate and 2.71% word error rate, outperforming the state-of-the-art models using the same amount of paired data.

Abstract:
We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. In contrast to previous approaches that either lack the ability to generalize to arbitrary identity or fail to preserve attributes like facial expression and gaze direction, our framework is capable of transferring the identity of an arbitrary source face into an arbitrary target face while preserving the attributes of the target face. We overcome the above defects in the following two ways. First, we present the ID Injection Module (IIM) which transfers the identity information of the source face into the target face at feature level. By using this module, we extend the architecture of an identity-specific face swapping algorithm to a framework for arbitrary face swapping. Second, we propose the Weak Feature Matching Loss which efficiently helps our framework to preserve the facial attributes in an implicit way. Extensive experiments on wild faces demonstrate that our SimSwap is able to achieve competitive identity performance while preserving attributes better than previous state-of-the-art methods.

Abstract:
Though recent advances in point cloud completion have shown exciting promise with learning-based methods, most of them still generate coarse point clouds with a fixed number of points (e.g. 2048). In this paper, we propose Vaccine-Style-Net, a new point cloud completion method that can produce high resolution 3D shapes with complete smooth surface. Vaccine-Style-Net performs point cloud completion in the function space of 3D surface, which represent the 3D surface as the continuous decision boundary function. Meanwhile, a reinforcement learning agent is embedded to deduce the complete 3D geometry from the incomplete point cloud. In contrast to the existing approaches, the completed 3D shapes produced by our method can be any resolution without excessive memory footprint. Moreover, to increase the diversity and adaptability of the method, we introduce two-type-free-form masks to simulate various corrupted inputs as well as a mask dataset called onion-peeling-mask (OPM). Finally, we discuss the limitations of existing evaluation metrics for shape completion tasks and explore a novel metric to supplement the existing ones. Experiments demonstrate that our method not only achieves competitive results qualitatively and quantitatively but also can produce a continuous 3D shape with any resolution.

Abstract:
Skeleton-based human action recognition has attracted much attention with the prevalence of accessible depth sensors. Recently, graph convolutional networks (GCNs) have been widely used for this task due to their powerful capability to model graph data. The topology of the adjacency graph is a key factor for modeling the correlations of the input skeletons. Thus, previous methods mainly focus on the design/learning of the graph topology. But once the topology is learned, only a single-scale feature and one transformation exist in each layer of the networks. Many insights, such as multi-scale information and multiple sets of transformations, that have been proven to be very effective in convolutional neural networks (CNNs), have not been investigated in GCNs. The reason is that, due to the gap between graph-structured skeleton data and conventional image/video data, it is very challenging to embed these insights into GCNs. To overcome this gap, we reinvent the split-transform-merge strategy in GCNs for skeleton sequence processing. Specifically, we design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition. Our network is constructed by repeating a building block that aggregates multi-granularity information from both the spatial and temporal paths. Extensive experiments demonstrate that our network outperforms state-of-the-art methods by a significant margin with only 1/5 of the parameters and 1/10 of the FLOPs.

Abstract:
A large number of recent studies on adversarial attack have verified that a Deep Neural Network (DNN) model designed for non-sequential recognition (NSR) tasks (e.g., classification, detection and segmentation) can be easily fooled by adversarial examples. However, only a few researches pay attention to the adversarial attack on sequential recognition (SR). They either apply the attack methods proposed for NSR to SR by neglecting the sequential dependencies, or focus on attacking specific SR models without considering the generality. In this paper, we study the adversarial attack on the general and popular DNN structure of CNN+RNN, i.e., the combination of convolutional neural network (CNN) and recurrent neural network (RNN), which has been widely used in various SR tasks. We take the scene text recognition (STR) and image captioning (IC) as case study, and derive the objective function for attacking the CNN+RNN based models with targeted and untargeted attack modes, and then developed an optimization-based algorithm to learn adversarial perturbations from the derived gradients of each character (or word) in sequence by incorporating the sequential dependencies. Extensive experiments show that our proposed method can effective fool several state-of-the-arts including four STR models and two IC models with higher successful rate and less time consumption, comparing to three latest attack methods.

Abstract:
Cross-dataset facial expression recognition (FER) has remained a challenging problem due to the obvious biases caused by diverse subjects and various collection conditions. To this end, domain adaption can be adopted as an effective solution by learning invariant representations across domains (datasets). However, FER requires special consideration of its specific problems e.g., uncertainties caused by ambiguous facial images, and diverse inter- and intra-class relationship. Such uncertainties already exist in single dataset FER, and could be significantly aggravated by enlarged class-wise discrepancies under cross-dataset scenarios. To mitigate this problem, this paper proposes an unsupervised domain adaptation method via regularized conditional alignment for FER, which adversarially reduces domain- and class-wise discrepancies while explicitly dealing with uncertainties within and across domain. Specifically, the proposed method effectively suppresses uncertainties in FER transfer tasks via: 1) semantics-preserving adaptation framework which enforces both domain-invariant learning and class-level semantic consistency between source and target expression data, where discriminative cluster structures are simultaneously retained; 2) auxiliary uncertainty regularization which further constrains the ambiguity of cluster boundaries to guarantee the transferring reliability, thus discouraging the negative transfer brought by divergent facial images. Evaluation experiments on publicly available datasets demonstrate that the proposed method significantly outperforms the current state-of-the-art methods.

Abstract:
Self-supervised depth estimation has shown great prospects in inferring 3D structures using purely unannotated images. However, its performance usually drops when trained on the images with changing brightness and moving objects. In this paper, we address this issue by enhancing the robustness of the self-supervised paradigm using a set of image-based and geometry-based constraints. Our contributions are threefold, 1) we propose a gradient-based robust photometric loss which restrains the false supervisory signals caused by brightness changes, 2) we propose to filter out the unreliable areas that violate the rigid assumption by a novel combined selective mask, which is computed on the forward pass of the network by leveraging the inter-loss consistency and the loss-gradient consistency, and 3) we constrain the motion estimation network to generate across-frame consistent motions via proposing a triplet-based cycle consistency constraint. Extensive experiments conducted on KITTI, Cityscape and Make3D datasets demonstrate the superiority of our method, that the proposed method can effectively handle complex scenes with changing brightness and object motions. Both qualitative and quantitative results show that the proposed method outperforms the state-of-the-art methods.

Abstract:
The problem of deforming an artist-drawn caricature according to a given normal face expression is of interest in applications such as social media, animation and entertainment. This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. The key of our solution is a novel method to model caricature expression, which extends traditional 3DMM representation to caricature domain. The method consists of shape modelling and texture generation for caricatures. Geometric optimization is developed to create identity-preserving blendshapes for reconstructing accurate and stable geometric shape, and a conditional generative adversarial network (cGAN) is designed for generating dynamic textures under target expressions. The combination of both shape and texture components makes the non-trivial expressions of a caricature be effectively defined by the extension of the popular 3DMM representation and a caricature can thus be flexibly deformed into arbitrary expressions with good results visually in both shape and color spaces. The experiments demonstrate the effectiveness of the proposed method.

Abstract:
The ability of recommending cold items (that have no behavior history) is a core strength of multimedia recommendation compared with behavior-only collaborative filtering. To learn effective item representation, a key challenge lies in the discrepancy between training and testing, since the cold items only exist in the testing data. This means that the signal used to represent an item varies during training and testing --- in the training stage, we can represent an item with both collaborative embedding and content embedding; whereas in the testing stage, we represent a cold item with content embedding only. Nevertheless, existing learning frameworks omit this critical discrepancy, resulting in suboptimal item representation for multimedia recommendation.

Abstract:
2D image-based 3D object retrieval is a very important task in computer vision and big data management. Conventional image-based 3D object retrieval usually assumes that the images are from one single domain. However, for real applications, 2D images may be from multiple domains (e.g., real image, sketch, and quick draw). It raises significant challenges for this task since these 2D images have a great domain gap with each other as well as a great modality gap with 3D objects. To address these issues, we propose an unsupervised Domain-Specific Alignment Network (DSAN) for multi-domain image-based 3D object retrieval. The proposed method aims to reduce domain discrepancy by domain-specific alignment network with multi-level moment matching, including first-order moment and second-order moment. Based on the observation that for any given sample, different domain classifiers should output the same label, we design a domain-specific classifier alignment module. To our knowledge, the proposed method is the first unsupervised work to align multiple-domain 2D images with 3D objects in an end-to-end manner. The multi-domain dataset MDI3D is utilized to advocate the research on this task, and the extensive experimental results demonstrate the superiority of the proposed method.

Abstract:
Reorganizing implicit feedback of users as a user-item interaction graph facilitates the applications of graph convolutional networks (GCNs) in recommendation tasks. In the interaction graph, edges between user and item nodes function as the main element of GCNs to perform information propagation and generate informative representations. Nevertheless, an underlying challenge lies in the quality of interaction graph, since observed interactions with less-interested items occur in implicit feedback (say, a user views micro-videos accidentally). This means that the neighborhoods involved with such false-positive edges will be influenced negatively and the signal on user preference can be severely contaminated. However, existing GCN-based recommender models leave such challenge under-explored, resulting in suboptimal representations and performance.

Abstract:
We present the SphericRTC system for real-time 360-degree video communication. 360-degree video allows the viewer to observe the environment in any direction from the camera location. This more-immersive streaming experience allows users to more-efficiently exchange information and can be beneficial in the real-time setting. Our system applies a novel approach to select representations of 360-degree frames to allow efficient, content-adaptive delivery. The system performs joint content and bitrate adaptation in real-time by offloading expensive transformation operations to the GPU via CUDA. The system demonstrates that the multiple sub-components -- viewport feedback, representation selection, and joint content and bitrate adaptation -- can be effectively integrated within a single framework. Compared to a baseline implementation, views in SphericRTC have consistently higher visual quality. The median Viewport-PSNR of such views is 2.25 dB higher than views in the baseline system.

Abstract:
Volumetric video allows viewers to experience highly-realistic 3D content with six degrees of freedom in mixed reality (MR) environments. Rendering complex volumetric videos can require a prohibitively high amount of computational power for mobile devices. A promising technique to reduce the computational burden on mobile devices is to perform the rendering at a cloud server. However, cloud-based rendering systems suffer from an increased interaction (motion-to-photon) latency that may cause registration errors in MR environments. One way of reducing the effective latency is to predict the viewer's head pose and render the corresponding view from the volumetric video in advance.

Abstract:
Reconstructing a human portrait in a realistic and convenient manner is critical for human modeling and understanding. Aiming at light-weight and realistic human portrait reconstruction, in this paper we propose Neural3D: a novel neural human portrait scanning system using only a single RGB camera. In our system, to enable accurate pose estimation,we propose a context-aware correspondence learning approach which jointly models the appearance, spatial and motion information between feature pairs. To enable realistic reconstruction and suppress the geometry error, we further adopt a point-based neural rendering scheme to generate realistic and immersive portrait visualization in arbitrary virtual view-points. By introducing these learning-based technical components into the pure RGB-based human modeling framework, we can achieve both accurate camera pose estimation and realistic free-viewpoint rendering of the reconstructed human portrait. Extensive experiments on a variety of challenging capture scenarios demonstrate the robustness and effectiveness of our approach.

Abstract:
360-degree video is an emerging medium that presents an immersive view of the environment to the user. Despite its potential to provide an immersive watching experience, 360-degree video has not achieved widespread popularity. A significant cause of this slow adoption is the high-bandwidth requirements of the format. The primary source of bandwidth inefficiency in 360-degree video streaming, un-addressed in popular transmission methods, is the discrepancy between the pixels sent over the network (typically the full omnidirectional view) and the pixels displayed in the head-mounted display's field of view. At worst, roughly 88% of transmitted pixels remain unviewed.

Abstract:
360-degree video streaming commonly encodes and transmits the video as independently-decodable tiles to conserve bandwidth of regions out of the viewer's field of view (FoV). The bitrate of the tiles, however, can vary significantly across the tiles, complicating the choice of the representation to download for each tile in each segment to adapt to the bandwidth dynamics. In this paper, we model the tile rate allocation problem as a multiclass knapsack problem with a dynamic profit function that is a function of the FoV and the buffer occupancy. Experiments show that our approach can reduce bandwidth wastage by up to 41%, the number of stalls by up to 31%, stall durations by up to 26.5%, switches in quality by up to 20%, without sacrificing the quality of the tiles within the FoV, even when there are significant head movement and changes in FoV during streaming.

Abstract:
In real person re-identification (ReID) tasks, pedestrians are often obscured by other pedestrians or objects; moreover, changes in poses or observation perspectives also commonly exist in partial-person ReID. To the best of our knowledge, few works simultaneously focus on these two issues. In this work, we propose a novel texture semantic alignment (TSA) approach with the visibility-aware for partial person ReID task where the occlusion issue and changes in poses are simultaneously explored in an end-to-end unified framework. Specifically, we first employ a texture alignment scheme with the semantic visibility of a person's image to solve the issue of changes in poses that can enhance the alignment and generalization capability of the models. Second, we design a human pose-based partial region alignment scheme to solve the occlusion problem that makes TSA method emphasize the shared body parts. Finally, these two networks jointly learn these aspects. Extensive experimental results demonstrate that our proposed TSA method is very effective and robust for simultaneously handling occlusion and changes in pose, and it can outperform state-of-the-art approaches by a large margin and achieves an improvement of 5% and 6.4% on the rank-1 accuracy over the visibility-aware part model (VPM) method (published in CVPR 2019) on the Partial ReID and Partial-iLIDS datasets, respectively.

Abstract:
Neural network-based models are notoriously known for their adversarial vulnerability. Recent adversarial machine learning mainly focused on images, where a small perturbation can be simply added to fool the learning model. Very recently, this practice has been explored in human action video attacks by adding perturbation to key frames. Unfortunately, frame selection is usually computationally expensive in run-time, and adding noises to all frames is unrealistic, either. In this paper, we present a novel yet efficient approach to address this issue. Multi-modal video data such as RGB, depth and skeleton data have been widely used for human action modeling, and they have been demonstrated with superior performance than a single modality. Interestingly, we observed that the skeleton data is more "vulnerable" under adversarial attack, and we propose to leverage this "Achilles' Heel" to attack multi-modal video data. In particular, first, an adversarial learning paradigm is designed to perturb skeleton data for a specific action under a black box setting, which highlights how body joints and key segments in videos are subject to attack. Second, we propose a graph attention model to explore the semantics between segments from different modalities and within a modality. Third, the attack will be launched in run-time on all modalities through the learned semantics. The proposed method has been extensively evaluated on multi-modal visual action datasets, including PKU-MMD and NTU-RGB+D to validate its effectiveness.

Abstract:
Poem generation from image aims to automatically generate the poetic sentences for presenting the image content or overtone. Previous works focused on 1-to-1 image-poem generation with the demands of poeticness and content relevance. This paper proposes the paradigm of multiple poems generation from one image, which is closer to human poetizing but more challenging. Its key problem is to simultaneously guarantee the diversity of multiple poems with poeticness and relevance. To this end, we propose an end-to-end probabilistic Diverter-Guider Recurrent Network (DG-Net), which is a context-based encoder-decoder generative model with the hierarchical stochastic variables. Specifically, the diverter-variable represents the decoding-context inferred from the input image to diversify the poem themes; the guider-variable is introduced as an attribute decoder to restricts the word-choice with supervised information. Extensive experiments on automatic evaluations and human judgments demonstrate the superior performance of DG-Net than existing poem generation methods. Qualitative study show that our model can generate diverse poems with the poeticness and relevance.

Abstract:
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. The framework consists of two innovative fusion schemes. Firstly, unlike existing multimodal methods that necessitate individual encoders for different modalities, we verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder, which also enables implicit fusion via joint feature representation learning. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively. To take advantage of such scheme, we introduce two asymmetric fusion operations including channel shuffle and pixel shift, which learn different fused features with respect to different fusion directions. These two operations are parameter-free and strengthen the multimodal feature interactions across channels as well as enhance the spatial feature discrimination within channels. We conduct extensive experiments on semantic segmentation and image translation tasks, based on three publicly available datasets covering diverse modalities. Results indicate that our proposed framework is general, compact and is superior to state-of-the-art fusion frameworks.

Abstract:
We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. Such video contents are very common in social media, teleconferencing, distance education, tele-medicine, etc., and often need to be transmitted with limited bandwidth. The proposed CNN method exploits the correlations among three modalities, video, audio and emotion state of the speaker, to remove the video compression artifacts caused by spatial down sampling and quantization. The deep learning approach turns out to be ideally suited for the video restoration task, as the complex non-linear cross-modality correlations are very difficult to model analytically and explicitly. The new method is a video post processor that can significantly boost the perceptual quality of aggressively compressed talking head videos, while being fully compatible with all existing video compression standards.

Abstract:
In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. The majority of existing methods focus on generating video moment candidates with the help of multi-scale sliding window segmentation. They hence inevitably suffer from numerous candidates, which result in the less effective retrieval process. In addition, the spatial scene tracking is crucial for realizing the video moment localization process, but it is rarely considered in traditional techniques. To this end, we innovatively contribute a spatial-temporal reinforcement learning framework. Specifically, we first exploit a temporal-level reinforcement learning to dynamically adjust the boundary of localized video moment instead of the traditional window segmentation strategy, which is able to accelerate the localization process. Thereafter, a spatial-level reinforcement learning is proposed to track the scene on consecutive image frames, therefore filtering out less relevant information. Lastly, an alternative optimization strategy is proposed to jointly optimize the temporal- and spatial-level reinforcement learning. Thereinto, the two tasks of temporal boundary localization and spatial scene tracking are mutually reinforced. By experimenting on two real-world datasets, we demonstrate the effectiveness and rationality of our proposed solution.

Abstract:
It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.

Abstract:
Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.

Abstract:
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities. We extensively evaluate our model on the challenging V-COCO and HICO-DET datasets, and results validate that our approach outperforms state-of-the-arts under both fully-supervised and zero-shot settings.

Abstract:
Every child is the leading actor in her/his unique world. To help them achieve performance, an interactive art called 'little world' is proposed to let virtual humans accompany children on drama performance. Theatrical adaptation rewrites the novel to an interactive drama suitable for children. Little world builds drama scenes and virtual humans for characters and lets children interact with them by speech and actions.

Abstract:
In the last decade, the Video Browser Showdown (VBS) became a comparative platform for various interactive video search tools competing in selected video retrieval tasks. However, the participation of new teams with an own, novel tool is prohibitively time-demanding because of the large number and complexity of components required for constructing a video search system from scratch. To partially alleviate this difficulty, we provide an open-source version of the lightweight known-item search system SOMHunter that competed successfully at VBS 2020. The system combines several features for text-based search initialization and browsing of large result sets; in particular a variant of W2VV++ model for text search, temporal queries for targeting sequences of frames, several types of displays including the eponymous self-organizing map view, and a feedback-based approach for maintaining the relevance scores inspired by PICHunter. The minimalistic, easily extensible implementation of SOMHunter should serve as a solid basis for constructing new search systems, thus facilitating easier exploration of new video retrieval ideas.

Abstract:
With the spread of internet television, many studies have been conducted to recommend relevant information for TV programs. Such as NHK Hybridcast is a new TV service that provides relevant information on the same screen during a TV program broadcast. However, current services cannot recommend supplementary information for TV programs based on user viewing behavior. Therefore, in this paper, we propose a video viewing support system to recommend supplementary information using geographical relationships based on user interaction.

Abstract:
Social media is an indispensable part in modern life and social media popularity prediction can be applied to many aspects of sociality. In this paper, we propose a novel combined framework for social media popularity prediction, which accomplishes feature generalization and temporal modeling based on multi-modal feature extraction. On the one hand, in order to address the generalization problem caused by massive missing data, we train two CatBoost models with different datasets and integrate their outputs with a linear combination. On the other hand, sliding window average is employed to mine potential short-term dependency for each user's post sequence. Extensive experiments show that our proposed framework has superiorities in both feature generalization and temporal modeling. Besides, our approach achieves the 1st place on the leader board of the SMP Challenge in 2020, which proves the effectiveness of our proposed framework.

Abstract:
Recently, multi-object tracking (MOT) for estimating trajectories of pedestrians has undergone fast development and played an important role in human-centric video analysis. However, video analysis in complex events (e.g. scenes in HiEve dataset) is still under-explored. In complex real-world scenarios, domain gap in unseen testing scenes and severe occlusion problem that disconnects tracks are challenging for existing online MOT methods without domain adaptation. To alleviate domain gap, we study the problem in a transductive learning setting, which assumes that unlabeled testing data is available for learning offline tracking. We propose a transductive interactive self-training method to adapt the tracking model to unseen crowded scenes with unlabeled testing data by means of teacher-student interative learning. To reduce prediction variance in an unseen domain, we train two different models and teach one model with pseudo labels of unlabeled data predicted by the other model interactively. To improve robustness against occlusions during self-training, we exploit disconnected track interpolation (DTI) to refine the predicted pseudo labels. Our method achieved MOTA of 60.23 on HiEve dataset and won the first place of Multi-person Motion Tracking in Complex Events (with Private Detection) in the ACM MM Grand Challenge on Large-scale Human-centric Video Analysis in Complex Events.

Abstract:
In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. Specifically, a long video is split into some scenes and entities in the scenes are tracked. Text, audio and visual features in a scene are extracted to predict relationships between different entities in the scene. The relationships between entities construct a knowledge graph of the video and can be used to answer some queries about the video. The experimental results show that our method performs well for deep video understanding on the HLVU dataset.

Abstract:
While a multitude of approaches for extracting semantic information from multimedia documents has emerged in recent years, isolating any form of holistic semantic representation from a larger type of document, such as a movie, is not yet feasible. In this paper we present our approaches used in the first instance of the Deep Video Understanding Challenge, using a combination of several multi-modal detectors and an integration scheme informed by methods from the semantic web context in order to determine the capabilities limitations of currently available methods for the extraction of semantic relations between the characters and locations relevant to the narrative of a movie.

Abstract:
The task of person-level action recognition in complex events aims to densely detect pedestrians and individually predict their actions from surveillance videos. In this paper, we present a simple yet efficient pipeline for this task, referred to as TSD-TSM networks. Firstly, we adopt the TSD detector for the pedestrian localization on each single keyframe. Secondly, we generate the sequential ROIs for a person proposal by replicating the adjusted bounding box coordinates around the keyframe. Particularly, we propose to conduct straddling expansion and region squaring on the original bounding box of a person proposal to widen the potential space of motion and interaction and lead to a square box for ROI detection. Finally, we adapt the TSM classifier on the generated ROI sequences to perform action classification and further adopt late fusion to promote the prediction. Our proposed pipeline achieved the 3rd place in the ACM-MM 2020 grand challenge, i.e., Large-scale Human-centric Video Analysis in Complex Events (Track-4), obtaining final 15.31% wf-mAP@avg and 20.63% f-mAP@avg on the testing set.

Abstract:
Over the last ten years, we have seen a strong progression of technology around smartphones. Each new generation acquires capabilities that significantly increase performance. On the other hand, several deep learning tools are offered today by the giants of the net for mobile, embedded devices and IoT. The proposed libraries allow a machine learning inference on the device with low latency. They provide pre-trained models, but one can also use one's own models and run them on mobile, embedded or microcontroller devices. Lack of privacy, poor Internet connectivity and high cost of cloud platform let on-device inference became popular through app developers but there are more significant challenges especially for real-time tasks like augmented reality or autonomous driving. This PhD research aims at providing a path for developers to help them choose the best methods and tools to do real-time inference on mobile devices. In this paper, we present the performance benchmark of four popular open-source deep learning inference frameworks used on mobile devices on three different convolutional neural network models. We focus our work on image classification process and particularly on validation image bank of ImageNet 2012 dataset. We try to answer three questions : How does a framework influence model prediction and latency - Why some frameworks are better in terms of latency/accuracy than others with the same model - And what are the difficulties to implement these frameworks inside a mobile application - Our first findings demonstrate that low-level software implementations chosen in frameworks, model conversion steps and parameters set in the framework have a big impact on performance and accuracy.

Abstract:
In this work, we develop deep neural networks for predicting affective responses from movies taking both audio and video streams into account. This study also tackles the issue of how to build a representation of video and audio in order to predict emotions that movies elicit in viewers. Besides, we analyse and identify helpful features extracted from video and audio streams that are important for the design of a good emotion prediction model. Fusion techniques are also taken into account with the aim to obtain the highest prediction accuracy.

Abstract:
Although recent object detectors have shown excellent performance for vehicle detection, they are incompetent for scenarios with a relatively large number of vehicles. In this paper, we explore the dense vehicle detection given the number of vehicles. Existing crowd counting methods cannot directly applied for dense vehicle detection due to insufficient description of density map, and the lack of effective constraint for mining the spatial awareness of dense vehicles. Inspired by these observations, a conceptually simple yet efficient framework, called CODAN, is proposed for dense vehicle detection. The proposed approach is composed of three major components: (i) an efficient strategy for generating multi-scale density maps (MDM) is designed to represent the vehicle counting, which can capture the global semantics and spatial information of dense vehicles, (ii) a multi-branch attention module (MAM) is proposed to bridging the gap between object counting and vehicle detection framework, (iii) with the well-designed density maps as explicit supervision, an effective counting-awareness loss (C-Loss) is employed to guide the attention learning by building the pixel-level constrain. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art methods. The impressive results indicate that vehicle detection and counting can be mutually supportive, which is an important and meaningful finding.

Abstract:
Due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-grained (FG) models directly through web images tends to have an inferior recognition ability. In the literature, to alleviate this issue, loss correction methods try to estimate the noise transition matrix, but the inevitable false correction would cause severe accumulated errors. Sample selection methods identify clean ("easy") samples based on the fact that small losses can alleviate the accumulated errors. However, "hard" and mislabeled examples that can both boost the robustness of FG models are also dropped. To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. Our key idea is to additionally identify and correct reusable samples, and then leverage them together with clean examples to update the networks. We demonstrate the superiority of the proposed approach from both theoretical and experimental perspectives.

Abstract:
In this paper, we propose a scene-aware context reasoning method that exploits context information from visual features for unsupervised abnormal event detection in videos, which bridges the semantic gap between visual context and the meaning of abnormal events. In particular, we build na spatio-temporal context graph to model visual context information including appearances of objects, spatio-temporal relationships among objects and scene types. The context information is encoded into the nodes and edges of the graph, and their states are iteratively updated by using multiple RNNs with message passing for context reasoning. To infer the spatio-temporal context graph in various scenes, we develop a graph-based deep Gaussian mixture model for scene clustering in an unsupervised manner. We then compute frame-level anomaly scores based on the context information to discriminate abnormal events in various scenes. Evaluations on three challenging datasets, including the UCF-Crime, Avenue, and ShanghaiTech datasets, demonstrate the effectiveness of our method.

Abstract:
Recently, Siamese networks based tracking algorithms have shown favorable performance. Latest work focuses on better feature embedding and target state estimation, which greatly improves the accuracy. Nevertheless, the simple cross-correlation operation of the features between a fixed template and the search region limits their robustness and discrimination capability. In this paper, we pay more attention to learn an outstanding similarity measure for robust tracking. We propose a novel relation network that can be integrated on top of previous trackers without any need for further training of the siamese networks, which achieves a superior discriminative ability. During online inference, we utilize the feedback from high-confidence tracking results to obtain an additional template and update it, which improves the robustness and generalization. We implement two versions of the proposed approach with the SiamFC-based tracker and SiamRPN-based tracker to validate the strong compatibility of our algorithm. Extensive experimental results on several tracking benchmarks indicate that the proposed method can effectively improve the performance and robustness of the underlying trackers without reducing speed too much, and performs superiorly against the state-of-the-art trackers.

Abstract:
In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users' needs and enhance their capabilities in the interaction process. It requires intelligent systems to observe from the perspective of the first person and predict an action before it occurs. Owing to the uncertainty of future, it is insufficient to perform action anticipation relying on visual information especially when there exists salient visual difference between past and future. In order to alleviate this problem, which we call visual gap in this paper, we propose one novel Intuition-Analysis Integrated (IAI) framework inspired by psychological research, which mainly consists of three parts: Intuition-based Prediction Network (IPN), Analysis-based Prediction Network (APN) and Adaptive Fusion Network (AFN). To imitate the implicit intuitive thinking process, we model IPN as an encoder-decoder structure and introduce one procedural instruction learning strategy implemented by textual pre-training. On the other hand, we allow APN to process information under designed rules to imitate the explicit analytical thinking, which is divided into three steps: recognition, transitions and combination. Both the procedural instruction learning strategy in IPN and the transition step of APN are crucial to improving the anticipation performance via mitigating the visual gap problem. Considering the complementarity of intuition and analysis, AFN adopts attention fusion to adaptively integrate predictions from IPN and APN to produce the final anticipation results. We conduct experiments on the largest egocentric video dataset. Qualitative and quantitative evaluation results validate the effectiveness of our IAI framework, and demonstrate the advantage of bridging visual gap by utilizing multi-modal information, including both visual features of observed segments and sequential instructions of actions.

Abstract:
Although heatmap regression is considered a state-of-the-art method to locate facial landmarks, it suffers from huge spatial complexity and is prone to quantization error. To address this, we propose a novel attentive one-dimensional heatmap regression method for facial landmark localization. First, we predict two groups of 1D heatmaps to represent the marginal distributions of the x and y coordinates. These 1D heatmaps reduce spatial complexity significantly compared to current heatmap regression methods, which use 2D heatmaps to represent the joint distributions of x and y coordinates. With much lower spatial complexity, the proposed method can output high-resolution 1D heatmaps despite limited GPU memory, significantly alleviating the quantization error. Second, a co-attention mechanism is adopted to model the inherent spatial patterns existing in x and y coordinates, and therefore the joint distributions on the x and y axes are also captured. Third, based on the 1D heatmap structures, we propose a facial landmark detector capturing spatial patterns for landmark detection on an image; and a tracker further capturing temporal patterns with a temporal refinement mechanism for landmark tracking. Experimental results on four benchmark databases demonstrate the superiority of our method.

Abstract:
Definitive embeddings remain a fundamental challenge of computational musicology for symbolic music in deep learning today. Analogous to natural language, music can be modeled as a sequence of tokens. This motivates the majority of existing solutions to explore the utilization of word embedding models to build music embeddings. However, music differs from natural languages in two key aspects: (1) musical token is multi-faceted -- it comprises of pitch, rhythm and dynamics information; and (2) musical context is two-dimensional -- each musical token is dependent on both melodic and harmonic contexts. In this work, we provide a comprehensive solution by proposing a novel framework named PiRhDy that integrates pitch, rhythm, and dynamics information seamlessly. PiRhDy adopts a hierarchical strategy which can be decomposed into two steps: (1) token (i.e., note event) modeling, which separately represents pitch, rhythm, and dynamics and integrates them into a single token embedding; and (2) context modeling, which utilizes melodic and harmonic knowledge to train the token embedding. A thorough study was made on each component and sub-strategy of PiRhDy.We further validate our embeddings in three downstream tasks -- melody completion, accompaniment suggestion, and genre classification. Results indicate a significant advancement of the neural approach towards symbolic music as well as PiRhDy's potential as a pretrained tool for a broad range of symbolic music applications.

Abstract:
Given a query image, vehicle Re-Identification is to search the same vehicle in multi-camera scenarios, which are attracting much attention in recent years. However, vehicle ReID severely suffers from the perspective variation problem. For different vehicles with similar color and type which are taken from different perspectives, all visual patterns are misaligned and warped, which is hard for the model to find out the exact discriminative regions. In this paper, we propose part perspective transformation module (PPT) to map the different parts of vehicle into a unified perspective respectively. The PPT disentangles the vehicle features of different perspectives and then aligns them in a fine-grained level. Further, we propose a dynamically batch hard triplet loss to select the common visible regions of the compared vehicles. Our approach helps the model to generate the perspective invariant features and find out the exact distinguishable regions for vehicle ReID. Extensive experiments on three standard vehicle ReID datasets show the effectiveness of our method.

Abstract:
Understanding fine-grained activities, such as sport highlights, is a problem being overlooked and receives considerably less research attention. Potential reasons include absences of specific fine-grained action benchmark datasets, research preferences to general super-categorical activities classification, and challenges of large visual similarities between fine-grained actions. To tackle these, we collect and manually annotate two sport highlights datasets, i.e., Basketball-8 & Soccer-10, for fine-grained action classification. Sample clips in the datasets are annotated with professional sub-categorical actions like "dunk", "goalkeeping" and etc. We also propose a Compact Bilinear Augmented Query Structured Attention (CBA-QSA) module and stack it on top of general three-dimensional neural networks in a plug-and-play manner to emphasize important spatio-temporal clues in highlight clips. Specifically, we adapt the hierarchical attention neural networks, which contain learnable query-scheme, on the video to identify discriminative spatial/temporal visual clues within highlight clips. We name this altered attention which separately learns a query for spatial/temporal feature as query structured attention (QSA). Furthermore, we inflate bilinear mapping, which is a mature technique to represent local pairwise interactions for image-level fine-grained classification, on video understanding. In detail, we extend its compact version (i.e., compact bilinear mapping (CBM) based on TensorSketch) to deal with the three-dimensional video signal for modeling local pairwise motion information. We eventually incorporate CBM and QSA together to form CBA-QSA neural networks for fine-grained sport highlights classifications. Experimental results demonstrate that CBA-QSA improves the general state-of-the-arts on Basketball-8 and Soccer-10 datasets.

Abstract:
Object detection and counting are related but challenging problems, especially for drone based scenes with small objects and cluttered background. In this paper, we propose a new Guided Attention network (GAnet) to deal with both object detection and counting tasks based on the feature pyramid. Different from the previous methods relying on unsupervised attention modules, we fuse different scales of feature maps by using the proposed weakly-supervised Background Attention (BA) between the background and objects for more semantic feature representation. Then, the Foreground Attention (FA) module is developed to consider both global and local appearance of the object to facilitate accurate localization. Moreover, the new data argumentation strategy is designed to train a robust model in the drone based scenes with various illumination conditions. Extensive experiments on three challenging benchmarks (i.e., UAVDT, CARPK and PUCPR+) show the state-of-the-art detection and counting performance of the proposed method compared with existing methods. Code can be found at https://isrc.iscas.ac.cn/gitlab/research/ganet.

Abstract:
The end-to-end VO (visual odometry) is a complicated task with the property of highly temporal dependency, but the design of its deep networks lacks thorough investigation. Meanwhile, NAS (Neural architecture search) has been widely searched and applied in many computer vision fields due to its advantage in automatic network design. However, most of the existing NAS frameworks only consider single image tasks such as image classification, lacking the consideration of the video (multi-frames) tasks such as VO. Therefore, this paper explores the network design for the VO task and proposes a more general single path based one-shot NAS, named VONAS, which can model sequential information for video-related tasks. Extensive experiments prove that the network architecture is significant for the (un)supervised VO. The models obtained by VONAS are lightweight and achieve SOTA performance with good generalization.

Abstract:
Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).

Abstract:
Retrieving video moments from an untrimmed video given a natural language as the query is a challenging task in both academia and industry. Although much effort has been made to address this issue, traditional video moment ranking methods are unable to generate reasonable video moment candidates and video moment localization approaches are not applicable to large-scale retrieval scenario. How to combine ranking and localization into a unified framework to overcome their drawbacks and reinforce each other is rarely considered. Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. The key of our solution is to formulate the video moment retrieval task as an adversarial learning problem with two tightly connected components. Specifically, a reinforcement learning is employed as a generator to produce a set of possible video moments. Meanwhile, a pairwise ranking model is utilized as a discriminator to rank the generated video moments and the ground truth. Finally, the generator and the discriminator are mutually reinforced in the adversarial learning framework, which is able to jointly optimize the performance of both video moment ranking and video moment localization. Extensive experiments on two well-known datasets have well verified the effectiveness and rationality of our proposed solution.

Abstract:
Vehicle re-identification (Re-Id) is a challenging task due to the inter-class similarity, the intra-class difference, and the cross-view misalignment of vehicle parts. Although recent methods achieve great improvement by learning detailed features from keypoints or bounding boxes of parts, vehicle Re-Id is still far from being solved. Different from existing methods, we propose a Parsing-guided Cross-part Reasoning Network, named as PCRNet, for vehicle Re-Id. The PCRNet explores vehicle parsing to learn discriminative part-level features, model the correlation among vehicle parts, and achieve precise part alignment for vehicle Re-Id. To accurately segment vehicle parts, we first build a large-scale Multi-grained Vehicle Parsing (MVP) dataset from surveillance images. With the parsed parts, we extract regional features for each part and build a part-neighboring graph to explicitly model the correlation among parts. Then, the graph convolutional networks (GCNs) are adopted to propagate local information among parts, which can discover the most effective local features of varied viewpoints. Moreover, we propose a self-supervised part prediction loss to make the GCNs generate features of invisible parts from visible parts under different viewpoints. By this means, the same vehicle from different viewpoints can be matched with the well-aligned and robust feature representations. Through extensive experiments, our PCRNet significantly outperforms the state-of-the-art methods on three large-scale vehicle Re-Id datasets.

Abstract:
While live 360° video streaming provides an enriched viewing experience, it is challenging to guarantee the user experience against the negative effects introduced by start-up delay, event-to-eye delay, and low frame rate. It is therefore imperative to understand how different computing tasks of a live 360° streaming system contribute to these three delay metrics. Although prior works have studied commercial live 360° video streaming systems, none of them has dug into the end-to-end pipeline and explored how the task-level time consumption affects the user experience. In this paper, we conduct the first in-depth measurement study of task-level time consumption for five system components in live 360° video streaming. We first identify the subtle relationship between the time consumption breakdown across the system pipeline and the three delay metrics. We then build a prototype Zeus to measure this relationship. Our findings indicate the importance of CPU-GPU transfer at the camera and the server initialization as well as the negligible effect of 360° video stitching on the delay metrics. We finally validate that our results are representative of real world systems by comparing them with those obtained with a commercial system.

Abstract:
With the recent advances in voice synthesis, AI-synthesized fake voices are indistinguishable to human ears and widely are applied to produce realistic and natural DeepFakes, exhibiting real threats to our society. However, effective and robust detectors for synthesized fake voices are still in their infancy and are not ready to fully tackle this emerging threat. In this paper, we devise a novel approach, named DeepSonar, based on monitoring neuron behaviors of speaker recognition (SR) system, i.e., a deep neural network (DNN), to discern AI-synthesized fake voices. Layer-wise neuron behaviors provide an important insight to meticulously catch the differences among inputs, which are widely employed for building safety, robust, and interpretable DNNs. In this work, we leverage the power of layer-wise neuron activation patterns with a conjecture that they can capture the subtle differences between real and AI-synthesized fake voices, in providing a cleaner signal to classifiers than raw inputs. Experiments are conducted on three datasets (including commercial products from Google, Baidu, etc) containing both English and Chinese languages to corroborate the high detection rates (98.1% average accuracy) and low false alarm rates (about 2% error rate) of DeepSonar in discerning fake voices. Furthermore, extensive experimental results also demonstrate its robustness against manipulation attacks (e.g., voice conversion and additive real-world noises). Our work further poses a new insight into adopting neuron behaviors for effective and robust AI aided multimedia fakes forensics as an inside-out approach instead of being motivated and swayed by various artifacts introduced in synthesizing fakes.

Abstract:
Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.

Abstract:
Indian Sign Language (ISL) is a complete language with its own grammar, syntax, vocabulary and several unique linguistic attributes. It is used by over 5 million deaf people in India. Currently, there is no publicly available dataset on ISL to evaluate Sign Language Recognition (SLR) approaches. In this work, we present the Indian Lexicon Sign Language Dataset - INCLUDE - an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. INCLUDE is recorded with the help of experienced signers to provide close resemblance to natural conditions. A subset of 50 word signs is chosen across word categories to define INCLUDE-50 for rapid evaluation of SLR meth- ods with hyperparameter tuning. As the first large scale study of SLR on ISL, we evaluate several deep neural networks combining different methods for augmentation, feature extraction, encoding and decoding. The best performing model achieves an accuracy of 94.5% on the INCLUDE-50 dataset and 85.6% on the INCLUDE dataset. This model uses a pre-trained feature extractor and encoder and only trains a decoder. We further explore generalisation by fine-tuning the decoder for an American Sign Language dataset. On the ASLLVD with 48 classes, our model has an accuracy of 92.1%; improving on existing results and providing an efficient method to support SLR for multiple languages.

Abstract:
Since real-world ubiquitous documents (e.g., invoices, tickets, resumes and leaflets) contain rich information, automatic document image understanding has become a hot topic. Most existing works decouple the problem into two separate tasks, (1) text reading for detecting and recognizing texts in images and (2) information extraction for analyzing and extracting key elements from previously extracted plain text.However, they mainly focus on improving information extraction task, while neglecting the fact that text reading and information extraction are mutually correlated. In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. Specifically, the multimodal visual and textual features of text reading are fused for information extraction and in turn, the semantics in information extraction contribute to the optimization of text reading. On three real-world datasets with diverse document images (from fixed layout to variable layout, from structured text to semi-structured text), our proposed method significantly outperforms the state-of-the-art methods in both efficiency and accuracy.

Abstract:
Graph Convolutional Networks (GCNs) have already demonstrated their powerful ability to model the irregular data, e.g., skeletal data in human action recognition, providing an exciting new way to fuse rich structural information for nodes residing in different parts of a graph. In human action recognition, current works introduce a dynamic graph generation mechanism to better capture the underlying semantic skeleton connections and thus improves the performance. In this paper, we provide an orthogonal way to explore the underlying connections. Instead of introducing an expensive dynamic graph generation paradigm, we build a more efficient GCN on a Riemann manifold, which we think is a more suitable space to model the graph data, to make the extracted representations fit the embedding matrix. Specifically, we present a novel spatial-temporal GCN (ST-GCN) architecture which is defined via the Poincaré geometry such that it is able to better model the latent anatomy of the structure data. To further explore the optimal projection dimension in the Riemann space, we mix different dimensions on the manifold and provide an efficient way to explore the dimension for each ST-GCN layer. With the final resulted architecture, we evaluate our method on two current largest scale 3D datasets, i.e., NTU RGB+D and NTU RGB+D 120. The comparison results show that the model could achieve a superior performance under any given evaluation metrics with only 40% model size when compared with the previous best GCN method, which proves the effectiveness of our model.

Abstract:
Although deep convolutional neural networks (CNNs) have achieved great success in computer vision tasks, its real-world application is still impeded by its voracious demand of computational resources. Current works mostly seek to compress the network by reducing its parameters or parameter-incurred computation, neglecting the influence of the input image on the system complexity. Based on the fact that input images of a CNN contain substantial redundancy, in this paper, we propose a unified framework, dubbed as ThumbNet, to simultaneously accelerate and compress CNN models by enabling them to infer on one thumbnail image. We provide three effective strategies to train ThumbNet. In doing so, ThumbNet learns an inference network that performs equally well on small images as the original-input network on large images. With ThumbNet, not only do we obtain the thumbnail-input inference network that can drastically reduce computation and memory requirements, but also we obtain an image downscaler that can generate thumbnail images for generic classification tasks. Extensive experiments show the effectiveness of ThumbNet, and demonstrate that the thumbnail-input inference network learned by ThumbNet can adequately retain the accuracy of the original-input network even when the input images are downscaled 16 times.

Abstract:
Rain removal is an important but challenging computer vision task as rain streaks can severely degrade the visibility of images that may make other visions or multimedia tasks fail to work. Previous works mainly focused on feature extraction and processing or neural network structure, while the current rain removal methods can already achieve remarkable results, training based on single network structure without considering the cross-scale relationship may cause information drop-out. In this paper, we explore the cross-scale manner between networks and inner-scale fusion operation to solve the image rain removal task. Specifically, to learn features with different scales, we propose a multi-sub-networks structure, where these sub-networks are fused via a cross-scale manner by Gate Recurrent Unit to inner-learn and make full use of information at different scales in these sub-networks. Further, we design an inner-scale connection block to utilize the multi-scale information and features fusion way between different scales to improve rain representation ability and we introduce the dense block with skip connection to inner-connect these blocks. Experimental results on both synthetic and real-world datasets have demonstrated the superiority of our proposed method, which outperforms over the state-of-the-art methods. The source code will be available at https://supercong94.wixsite.com/supercong94.

Abstract:
The attention mechanism has been widely applied to enhance pedestrian representation for person re-identification in videos. However, most existing methods learn the spatial and temporal attention separately, and thus ignore the correlation between them. In this work, we propose a novel Adaptive Spatio-Temporal Attention Network (ASTA-Net) to adaptively aggregate the spatial and temporal attention features into discriminative pedestrian representation for person re-identification in videos. Specifically, multiple Adaptive Spatio-Temporal Fusion modules within ASTA-Net are designed for exploring precise spatio-temporal attention on multi-level feature maps. They first obtain the preliminary spatial and temporal attention features via the spatial semantic relations for each frame and temporal dependencies among inconsecutive frames, then adaptively aggregate the preliminary attention features on the basis of their correlation. Moreover, an Adjacent-Frame Motion module is designed to explicitly extract motion patterns according to the feature-level variation among adjacent frames. Extensive experiments on the three widely-used datasets, i.e., MARS, iLIDS-VID and PRID2011, have demonstrated the effectiveness of the proposed approach.

Abstract:
How to estimate the distance between data instances is a fundamental problem in many artificial intelligence algorithms, and critical in diverse multimedia applications. A major challenge in the estimation is how to find an appropriate distance function when labeled data are insufficient for a certain task. Multi-task metric learning (MTML) is able to alleviate such data deficiency issue by learning distance metrics for multiple tasks together and sharing information between the different tasks. Recently, heterogeneous MTML (HMTML) has attracted much attention since it can handle multiple tasks with varied data representations. A major drawback of the current HMTML approaches is that only linear transformations are learned to connect different domains. This is suboptimal since the correlations between different domains may be very complex and highly nonlinear. To overcome this drawback, we propose a deep heterogeneous MTML (DHMTML) method, in which a nonlinear mapping is learned for each task by using a deep neural network. The correlations of different domains are exploited by sharing some parameters at the top layers of different networks. More importantly, the auto-encoder scheme and the adversarial learning mechanism are integrated and incorporated to help exploit the feature correlations in and between different tasks and the specific properties are preserved by learning additional task-specific layers together with the common layers. Experiments demonstrated that the proposed method outperforms single-task deep metric learning algorithms and other HMTML approaches consistently on several benchmark datasets.

Abstract:
Scene graph generation aims to produce structured representations for images, which requires to understand the relations between objects. Due to the continuous nature of deep neural networks, the prediction of scene graphs is divided into object detection and relation classification. However, the independent relation classes cannot separate the visual features well. Although some methods organize the visual features into graph structures and use message passing to learn contextual information, they still suffer from drastic intra-class variations and unbalanced data distributions. One important factor is that they learn an unstructured output space that ignores the inherent structures of scene graphs. Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. First, we propose a novel structure-aware embedding-to-classifier(SEC) module to incorporate both local and global structural information of relationships into the output space. Specifically, a set of context embeddings are learned via local graph based message passing and then mapped to a global structure based classification space. Second, since learning too many context-specific classification subspaces can suffer from data sparsity issues, we propose a hierarchical semantic aggregation(HSA) module to reduces the number of subspaces by introducing higher order structural information. HSA is also a fast and flexible tool to automatically search a semantic object hierarchy based on relational knowledge graphs. Extensive experiments show that the proposed HOSE-Net achieves the state-of-the-art performance on two popular benchmarks of Visual Genome and VRD.

Abstract:
Wearable egocentric cameras are typically harnessed to a wearer's head, giving them the unique advantage of capturing their points of view. Hoshen and Peleg have shown that egocentric cameras indirectly capture the wearer's gait, which can be used to identify a wearer based on their egocentric videos. The authors have shown a wearer recognition accuracy of up to 77% over 32 subjects. However, an important limitation of their work is that such gait features can be extracted only from walking sequences of a wearer. In this work, we take the privacy threat a notch higher and show that even the wearer's hand gestures, as seen through an egocentric video, leak wearer's identity. We have designed a model to extract and match hand gesture signatures from egocentric videos. We demonstrate the threat on the EPIC kitchen dataset containing 55 hours of the egocentric videos acquired from 32 subjects doing various activities. We show that: (1) Our model can recognize a wearer with an accuracy of up to 73% based on the same activity, i.e., the model has seen 'cut' activity by a wearer in the train set, and recognizes the wearer based on another 'cut' activity by him/her while testing. (2) The hand gesture signatures transfer across activities, i.e., even if our model does not see 'cut' activity of a wearer at the train time, but sees other activities such as 'wash', 'mix' etc., the model can still recognize a wearer with an accuracy of up to 60%, by matching hand gesture signatures of 'cut' at test time with train time signatures of 'wash' or 'mix'. (3) The hand gesture features even transfer across subjects, i.e., even if the model has not seen any activity by some subject, one can still verify a wearer (open-set), and predict that the same wearer has performed both activities with an Equal Error Rate of 15.21%. The code, trained models are available at https://egocentricbiometric.github.io/

Abstract:
Deep convolutional neural networks have made outstanding contributions in many fields such as computer vision in the past few years and many researchers published well-trained network for downloading. But recent studies have shown serious concerns about integrity due to model-reuse attacks and backdoor attacks. In order to protect these open-source networks, many algorithms have been proposed such as watermarking. However, these existing algorithms modify the contents of the network permanently and are not suitable for integrity authentication. In this paper, we propose a reversible watermarking algorithm for integrity authentication. Specifically, we present the reversible watermarking problem of deep convolutional neural networks and utilize the pruning theory of model compression technology to construct a host sequence used for embedding watermarking information by histogram shift. As shown in the experiments, the influence of embedding reversible watermarking on the classification performance is less than ±0.5% and the parameters of the model can be fully recovered after extracting the watermarking. At the same time, the integrity of the model can be verified by applying the reversible watermarking: if the model is modified illegally, the authentication information generated by original model will be absolutely different from the extracted watermarking information.

Abstract:
Retinex model is widely adopted in various low-light image enhancement tasks. The basic idea of the Retinex theory is to decompose images into reflectance and illumination. The ill-posed decomposition is usually handled by hand-crafted constraints and priors. With the recently emerging deep-learning based approaches as tools, in this paper, we integrate the idea of Retinex decomposition and semantic information awareness. Based on the observation that various objects and backgrounds have different material, reflection and perspective attributes, regions of a single low-light image may require different adjustment and enhancement regarding contrast, illumination and noise. We propose an enhancement pipeline with three parts that effectively utilize the semantic layer information. Specifically, we extract the segmentation, reflectance as well as illumination layers, and concurrently enhance every separate region, i.e. sky, ground and objects for outdoor scenes. Extensive experiments on both synthetic data and real world images demonstrate the superiority of our method over current state-of-the-art low-light enhancement algorithms.

Abstract:
The image generation model based on generative adversarial networks has recently received significant attention and can produce diverse, sharp, and realistic images. However, generating high-resolution images has long been a challenge. In this paper, we propose a progressive spatial recursive adversarial expansion model(called SpatialGAN) capable of producing high-quality samples of the natural image. Our approach uses a cascade of convolutional networks to progressively generate images in a part-to-whole fashion. At each level of spatial expansion, a separate image-to-image spatial adversarial expansion network (conditional GAN) is recursively trained based on context image generated by previous GAN or CGAN. Unlike other coarse-to-fine generative methods that constraint on generative process either by multi-scale resolution or by hierarchical feature, the SpatialGAN decomposes image space into multiple subspaces and gradually resolves uncertainties in the local-to-whole generative process. The SpatialGAN greatly stabilizes and speeds up the training, which allows us to produce images of high quality. Based on visual Inception Score and Fréchet Inception Distance, we demonstrate that the quality of images generated by SpatialGAN on several typical datasets is better than that of images generated by GANs without cascading and comparative with the state of art methods with cascading.

Abstract:
Generally, adaptive bitrates for variable Internet bandwidths can be obtained through multi-pass coding. Referenceless prediction-based methods show practical benefits compared with multi-pass coding to avoid excessive computational resource consumption, especially in low-latency circumstances. However, most of them fail to predict precisely due to the complex inner structure of modern codecs. Therefore, to improve the fidelity of prediction, we propose a referenceless prediction-based R-QP modeling (PmR-QP) method to estimate bitrate by leveraging a deep learning algorithm with only one-pass coding. It refines the global rate-control paradigm in modern codecs on flexibility and applicability with few adjustments as possible. By exploring the potentials of bitstream and pixel features from the prerequisite of one-pass coding, it can reach the expectation of bitrate estimation in terms of precision. To be more specific, we first describe the R-QP relationship curve as a robust quadratic R-QP modeling function derived from the Cauchy-based distribution. Second, we simplify the modeling function by fastening one operational point of the relationship curve received from the coding process. Third, we learn the model parameters from bitstream and pixel features, named them hybrid referenceless features, comprising texture information, hierarchical coding structure, and selected modes in intra-prediction. Extensive experiments demonstrate the proposed method significantly decreases the proportion of samples' bitrate estimation error within 10% by 24.60% on average over the state-of-the-art.

Abstract:
As reported by respected evaluation campaigns focusing both on automated and interactive video search approaches, deep learning started to dominate the video retrieval area. However, the results are still not satisfactory for many types of search tasks focusing on high recall. To report on this challenging problem, we present two orthogonal task-based performance studies centered around the state-of-the-art W2VV++ query representation learning model for video retrieval. First, an ablation study is presented to investigate which components of the model are effective in two types of benchmark tasks focusing on high recall. Second, interactive search scenarios from the Video Browser Showdown are analyzed for two winning prototype systems implementing a selected variant of the model and providing additional querying and visualization components. The analysis of collected logs demonstrates that even with the state-of-the-art text search video retrieval model, it is still auspicious to integrate users into the search process for task types, where high recall is essential.

Abstract:
Existing scene text detection methods achieve state-of-the-art performance by designing elaborate anchors or complex post-processing. Nonetheless, most methods still face the dilemma of detecting adjacent texts as one instance and long text with large character spacing as multiple fragments. To tackle these problems, we propose an anchor-free scene text detector leveraging Center-aware Representation to achieve accurate arbitrary-shaped scene text detection namely CRNet. Firstly, we propose a center-aware location algorithm to explicitly learn center regions and center points of text instances, which is able to separate adjacent text instances effectively. Then, a multi-scale context extraction module capable of extracting local context, long-range dependencies and global context adaptively is designed to effectively perceive long text with large character spacing. Finally, a low-level features enhancement block is introduced to enhance the geometric information of text. Extensive experiments conducted on several benchmarks including SCUT-CTW1500, Total-Text, ICDAR2015, ICDAR2017 MLT, and MSRA-TD500 demonstrate the effectiveness of our method. Specifically, without any anchor and complicated post-processing, our CRNet achieves 84.2% and 85.1% on CTW1500 and MSRA-TD500 in F-measure, outperforming all state-of-the-art anchor-based and anchor-free methods.

Abstract:
Normal integration is a key step in dense 3D reconstruction methods such as shape-from-shading and photometric stereo. However, normal integration cannot be guaranteed between spatially unconnected normal maps, which can ultimately cause a shape deformation in surface-from-normals (SfN). For the first time, this paper presents an efficient approach to address the fundamental problem of surface reconstruction from unconnected normal maps (denoted as "SfN+") using discrete geometry. We first design a normal piece pairing metric to measure the virtually pairing quality between two unconnected normal fragments, which is used as a new constraint for the boundary vertexes during mesh deformation. We then adopt a normal connecting significance indicator to adjust the influence of virtually connected vertexes, which further improves the overall shape deformation. Finally, we model the shape reconstruction of unconnected normal maps as a light-weight energy optimization framework by jointly considering the relaxation of connecting constraints and overall reconstruction error. Experiments show that the proposed SfN+ achieves a robust and efficient performance on dense 3D surface reconstruction.

Abstract:
Recently, multimodal dialogue systems have engaged increasing attention in several domains such as retail, travel, etc. In spite of the promising performance of pioneer works, existing studies usually focus on utterance-level semantic representations with hierarchical structures, which ignore the context-aware dependencies of multimodal semantic elements, i.e., words and images. Moreover, when integrating the visual content, they only consider images of the current turn, leaving out ones of previous turns as well as their ordinal information. To address these issues, we propose a Multimodal diAlogue systems with semanTic Elements, MATE for short. Specifically, we unfold the multimodal inputs and devise a Multimodal Element-level Encoder to obtain the semantic representation at element-level. Besides, we take into consideration all images that might be relevant to the current turn and inject the sequential characteristics of images through position encoding. Finally, we make comprehensive experiments on a public multimodal dialogue dataset in the retail domain, and improve the BLUE-4 score by 9.49, and NIST score by 1.8469 compared with state-of-the-art methods.

Abstract:
Refractive errors, such as myopia and astigmatism, can lead to severe visual impairment if not detected and corrected in time. Traditional methods of refractive error diagnosis rely on well-trained optometrists operating expensive and importable devices, constraining the vision screening process. Advance in smartphone camera has enabled novel low-cost ubiquitous vision screening to detect refractive error or ametropia through eye image processing, based on the principle of photorefraction. However, contemporary smartphone-based methods rely heavily on hand-crafted features and sufficiency of well-labeled data. To address these challenges, this paper exploits active learning methods with a set of Convolutional Neural Network features encoding information of human eyes from pre-trained gaze estimation model. This enables more effective training on refractive error detection models with less labeled data. Our experimental results demonstrate the encouraging effectiveness of our active learning approach. The new set of features is able to attain screening accuracy of more than 80% with mean absolute error less than 0.66, meeting the expectation of optometrists for 0.5 to 1. The proposed active learning also requires significantly fewer training samples of 18% in achieving satisfactory performance.

Abstract:
This paper presents an intelligent price suggestion system for online second-hand listings based on their uploaded images and text descriptions. The goal of price prediction is to help sellers set effective and reasonable prices for their second-hand items with the images and text descriptions uploaded to the online platforms. Specifically, we design a multi-modal price suggestion system which takes as input the extracted visual and textual features along with some statistical item features collected from the second-hand item shopping platform to determine whether the image and text of an uploaded second-hand item are qualified for reasonable price suggestion with a binary classification model, and provide price suggestions for second-hand items with qualified images and text descriptions with a regression model. To satisfy different demands, two different constraints are added into the joint training of the classification model and the regression model. Moreover, a customized loss function is designed for optimizing the regression model to provide price suggestions for second-hand items, which can not only maximize the gain of the sellers but also facilitate the online transaction. We also derive a set of metrics to better evaluate the proposed price suggestion system. Extensive experiments on a large real-world dataset demonstrate the effectiveness of the proposed multi-modal price suggestion system.

Abstract:
Both images and music can convey rich semantics and are widely used to induce specific emotions. Matching images and music with similar emotions might help to make emotion perceptions more vivid and stronger. Existing emotion-based image and music matching methods either employ limited categorical emotion states which cannot well reflect the complexity and subtlety of emotions, or train the matching model using an impractical multi-stage pipeline. In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. First, we construct a large-scale dataset, termed Image-Music-Emotion-Matching-Net (IMEMNet), with over 140K image-music pairs. Second, we propose cross-modal deep continuous metric learning (CDCML) to learn a shared latent embedding space which preserves the cross-modal similarity relationship in the continuous matching space. Finally, we refine the embedding space by further preserving the single-modal emotion relationship in the VA spaces of both images and music. The metric learning in the embedding space and task regression in the label space are jointly optimized for both cross-modal matching and single-modal VA prediction. The extensive experiments conducted on IMEMNet demonstrate the superiority of CDCML for emotion-based image and music matching as compared to the state-of-the-art approaches.

Abstract:
In the era of big data, few-shot learning has recently received much attention in multimedia analysis and computer vision due to its appealing ability of learning from scarce labeled data. However, it has been largely underdeveloped in the video domain, which is even more challenging due to the huge spatial-temporal variability of video data. In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. Specifically, we introduce a family of few-shot learners based on SlowFast networks which are used to extract informative features at multiple rates, and we incorporate a memory unit into each network to enable encoding and retrieving crucial information instantly. Furthermore, we propose a choice controller network to leverage the diversity of few-shot learners by learning to adaptively assign a confidence score to each SlowFast memory network, leading to a strong classifier for enhanced prediction. Experimental results on two widely-adopted video datasets demonstrate the effectiveness of the proposed method, as well as its superior performance over the state-of-the-art approaches.

Abstract:
Many real-world applications today like video surveillance and urban governance need to address the recognition of masked faces, where content replacement by diverse masks often brings in incomplete appearance and ambiguous representation, leading to a sharp drop in accuracy. Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. The de-occlusion module applies a generative adversarial network to perform face completion, which recovers the content under the mask and eliminates appearance ambiguity. The distillation module takes a pre-trained general face recognition model as the teacher and transfers its knowledge to train a student for completed faces using massive online synthesized face pairs. Especially, the teacher knowledge is represented with structural relations among instances in multiple orders, which serves as a posterior regularization to enable the adaptation. In this way, the knowledge can be fully distilled and transferred to identify masked faces. Experiments on synthetic and realistic datasets show the efficacy of the proposed approach.

Abstract:
The advancement of artificial intelligence and wearable computing triggers the radical innovation of cognitive applications. In this work, we propose VIMES, an augmented reality-based memory assistance system that helps recall declarative memory, such as whom the user meets and what they chat. Through a collaborative method with 20 participants, we design VIMES, a system that runs on smartglasses, takes the first-person audio and video as input, and extracts personal profiles and event information to display on the embedded display or a smartphone. We perform an extensive evaluation with 50 participants to show the effectiveness of VIMES for memory recall. VIMES outperforms (90% memory accuracy) other traditional methods such as self-recall (34%) while offering the best memory experience (Vividness, Coherence, and Visual Perspective all score over 4/5). The user study results show that most participants find VIMES useful (3.75/5) and easy to use (3.46/5).

Abstract:
Distance metric learning (DML) is critial in many multimedia application tasks. However, it is hard to learn a satisfactory distance metric given only a few labeled samples for each task. In this paper, we proposed a novel semi-supervised online multi-task DML method termed SOMTML, which enables the models describing different tasks to help each other during the metric learning procedure and thus improving their respective performance. Besides, unlabeled data are leveraged to further help alleviate the data deficiency issue in different tasks by designing a novel regularization term, which also allows some prior information to be incorporated. More importantly, a quite efficient algorithm is developed to update the metrics of all tasks adaptively. The proposed SOMTML is experimentally validated in two popular visual analytic-based applications: handwriting digits recognition and face retrieval. We compared the proposed method with competitive single-task and multi-task metric learning approaches. Extensive experimental results demonstrate the effectiveness and efficiency of the proposed SOMTML.

Abstract:
Nowadays, millions of users use community question answering (CQA) systems to share valuable knowledge. An essential function of CQA systems is the accurate matching of answers w.r.t a given question. Recent research exhibits the superior advantages of graph neural networks (GNNs) on modeling content semantics for CQA matching. However, existing GNN-based approaches are insufficient to deal with the multi-modal and redundant properties of CQA systems. In this paper, we propose a multi-modal attentive graph pooling approach (MMAGP) to model the multi-modal content of questions and answers with GNNs in a unified framework, which explores the multi-modal and redundant properties of CQA systems. Our model converts each question/answer into a multi-modal content graph, which can preserve the relational information within multi-modal content. Specifically, to exploit the visual information, we propose an unsupervised meta-path link prediction approach to extract labels from visual content and model them into the multi-modal graph. An attentive graph pooling network is proposed to select vertices in the multi-modal content graph that are significant for the matching adaptively, and generate a pooled graph via aggregating context information for selected vertices. An interaction pooling network is designed to infer the final matching score based on the interactions between the pooled graphs of the input question and answer. Experimental results on two real-world datasets demonstrate the superior performance of MMAGP compared with other state-of-the-art CQA matching models.

Abstract:
Face-based authentication systems are among the most commonly used biometric systems, because of the ease of capturing face images at a distance and in non-intrusive way. These systems are, however, susceptible to various presentation attacks, including printed faces, artificial masks, and makeup attacks. In this paper, we propose a novel solution to address makeup attacks, which are the hardest to detect in such systems because makeup can substantially alter the facial features of a person, including making them appear older/younger by adding/hiding wrinkles, modifying the shape of eyebrows, beard, and moustache, and changing the color of lips and cheeks. In our solution, we design a generative adversarial network for removing the makeup from face images while retaining their essential facial features and then compare the face images before and after removing makeup. We collect a large dataset of various types of makeup, especially malicious makeup that can be used to break into remote unattended security systems. This dataset is quite different from existing makeup datasets that mostly focus on cosmetic aspects. We conduct an extensive experimental study to evaluate our method and compare it against the state-of-the art using standard objective metrics commonly used in biometric systems as well as subjective metrics collected through a user study. Our results show that the proposed solution produces high accuracy and substantially outperforms the closest works in the literature.

Abstract:
Recorded cataract surgery videos play a prominent role in training and investigating the surgery, and enhancing the surgical outcomes. Due to storage limitations in hospitals, however, the recorded cataract surgeries are deleted after a short time and this precious source of information cannot be fully utilized. Lowering the quality to reduce the required storage space is not advisable since the degraded visual quality results in the loss of relevant information that limits the usage of these videos. To address this problem, we propose a relevance-based compression technique consisting of two modules: (i) relevance detection, which uses neural networks for semantic segmentation and classification of the videos to detect relevant spatio-temporal information, and (ii) content-adaptive compression, which restricts the amount of distortion applied to the relevant content while allocating less bitrate to irrelevant content. The proposed relevance-based compression framework is implemented considering five scenarios based on the definition of relevant information from the target audience's perspective. Experimental results demonstrate the capability of the proposed approach in relevance detection. We further show that the proposed approach can achieve high compression efficiency by abstracting substantial redundant information while retaining the high quality of the relevant content.

Abstract:
Virtual Reality (VR) and Augmented Reality (AR) technologies have become popular in recent years. Encoding and transmitting the omni-directional or 360^\circ video is critical and challenging for those applications. The 360^\circ video requires much higher bandwidth than the traditional planar video. A premium quality 360^\circ video with 120 frames per second (fps) and 24K resolution can easily consume bandwidth in the range of Gigabits-per-second~\cite1. On the other hand, at any given time, a user only watches a small portion of the 360^\circ scope within her Field-of-View (FoV). An effective way to reduce the bandwidth requirement of 360^\circ video is through FoV-adaptive streaming, which codes and delivers the predicted FoV region at higher quality, and discards or codes at lower quality the remaining regions. Such strategy has been quite extensively studied for video-on-demand \citefov_adapt_2,fov_adapt_3,1,tile_based_3,qian2016optimizing and live video streaming applications\citelive_1,live_2,live_3, sun2020flocking. Interactive applications, such as conferencing, gaming, and remote collaboration, can also benefit from 360^\circ video by creating an immersive environment for participants to interact with each other citeinteractive_gamming \citevr_conferencing \citelee2015outatime. However, realtime coding and streaming of 360^\circ video with extremely low latency, required for interactive applications, has not been sufficiently addressed. This work focuses on developing low-latency and FoV-adaptive coding and streaming strategies for interactive 360^\circ video streaming. We assume the sender and the receiver are connected by a network path with dynamically varying throughput without short-latency guarantee. The sender is either the video source, or a proxy server relaying the source video. The receiver is either the end user device that directly renders the video, or a local edge server that renders the video and transmit to the end user \citeHou2017.

Abstract:
Compactly representing the visual signals is of fundamental importance in various image/video-centered applications. Although numerous approaches were developed for improving the image and video coding performance by removing the redundancies within visual signals, much less work has been dedicated to the transformation of the visual signals to another well-established modality for better representation capability. In this paper, we propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. In particular, the deep learning model, which characterizes and absorbs the statistics of the input scene with online training, could be efficiently represented in the sense of rate-utility optimization to serve as the enhancement layer in the bitstream. As such, the overall performance can be further guaranteed by optimizing the new modality incorporated. The proposed framework is implemented on the state-of-the-art video coding standard (i.e., versatile video coding), and significantly better representation capability has been observed based on extensive evaluations.

Abstract:
Referring Expression Grounding (REG) aims at localizing a particular object in an image according to a language expression. Recent REG methods have achieved promising performance, but most of them are constrained to limited object categories due to the scale of current REG datasets. In this paper, we explore REG in a new scenario, where the REG model can ground novel objects out of REG training data. With this motivation, we propose a Concept-Context Disentangled network (CCD) which transfers concepts from auxiliary classification data with new categories meanwhile inherits context from REG data to ground new objects. Specially, we design a subject encoder to learn a cross-modal common semantic space, which can bridge the semantic and domain gap between auxiliary classification data and REG data. This common space guarantees CCD can transfer and recognize novel categories. Further, we learn the correspondence between image proposal and referring expression upon location and relationship. Benefiting from the disentangled structure, the context is relatively independent of the subject, so it can be better inherited from the REG training data. Finally, a language attention is learned to adaptively assign different importance to subject and context for grounding target objects. Experiments on four REG datasets show our method outperforms the compared approach on the new-category test datasets.

Abstract:
Representation learning of medical Knowledge Graph (KG) is an important task and forms the fundamental process for intelligent medical applications such as disease diagnosis and healthcare question answering. Therefore, many embedding models have been proposed to learn vector presentations for entities and relations but they ignore three important properties of medical KG: multi-modal, unbalanced and heterogeneous. Entities in the medical KG can carry unstructured multi-modal content, such as image and text. At the same time, the knowledge graph consists of multiple types of entities and relations, and each entity has various number of neighbors. In this paper, we propose a Multi-modal Multi-Relational Feature Aggregation Network (MMRFAN) for medical knowledge representation learning. To deal with the multi-modal content of the entity, we propose an adversarial feature learning model to map the textual and image information of the entity into the same vector space and learn the multi-modal common representation. To better capture the complex structure and rich semantics, we design a sampling mechanism and aggregate the neighbors with intra and inter-relation attention. We evaluate our model on three knowledge graphs, including FB15k-237, IMDb and Symptoms-in-Chinese with link prediction and node classification tasks. Experimental results show that our approach outperforms state-of-the-art method.

Abstract:
Recognizing visual categories from semantic descriptions is a promising way to extend the capability of a visual classifier beyond the concepts represented in the training data (i.e. seen categories). This problem is addressed by (generalized) zero-shot learning methods (GZSL), which leverage semantic descriptions that connect them to seen categories (e.g. label embedding, attributes). Conventional GZSL are designed mostly for object recognition. In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. Conventional GZSL representations are not rich enough to capture these local discriminative differences. Addressing these limitations, we propose a feature generation framework with two novel components: 1) multiple sources of semantic information (i.e. attributes, word embeddings and descriptions), 2) region descriptions that can enhance scene discrimination. To generate synthetic visual features we propose a two-step generative approach, where local descriptions are sampled and used as conditions to generate visual features. The generated features are then aggregated and used together with real features to train a joint classifier. In order to evaluate the proposed method, we introduce a new dataset for zero-shot scene recognition with multi-semantic annotations. Experimental results on the proposed dataset and SUN Attribute dataset illustrate the effectiveness of the proposed method.

Abstract:
Video summarization aims to select representative frames to retain high-level information, which is usually solved by predicting the segment-wise importance score via a softmax function. However, softmax function suffers in retaining high-rank representations for complex visual or sequential information, which is known as the Softmax Bottleneck problem. In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. Furthermore, the DMASum significantly exploits both visual and sequential attention that connects local key-frame and global attention in an accumulative way. We adopt the new evaluation protocol on two public datasets, SumMe, and TVSum. Both qualitative and quantitative experiments manifest significant improvements over the state-of-the-art methods.

Abstract:
Video moment retrieval aims to localize the target moment in an video according to the given sentence. The weak-supervised setting only provides the video-level sentence annotations during training. Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment, but ignore the intra-sample confrontment between moments with semantically similar contents. Thus, these methods fail to distinguish the target moment from plausible negative moments. In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. Concretely, we first devise a language-aware filter to generate an enhanced video stream and a suppressed video stream. We then design the sharable two-branch proposal module to generate positive proposals from the enhanced stream and plausible negative proposals from the suppressed one for sufficient confrontment. Further, we apply the proposal regularization to stabilize the training process and improve model performance. The extensive experiments show the effectiveness of our method. Our code is released at here.

Abstract:
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: ActivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.

Abstract:
Automatic generation of font and text design in the wild is a challenging task since font and text in real world exhibit various visual effects. In this paper, we propose a novel model, JointFontGAN, to derive fonts, including both geometric structures and shape contents in correctness and consistency with very few font samples available. Specifically, we design an end-to-end deep learning based approach for font generation through the new multi-stream extended conditional generative adversarial network (XcGAN) models, which jointly learn and generate both font skeleton and glyph representations simultaneously. It can adapt to the geometric variability and content scalability at the neural network level. Then, we apply it, along with the developed efficient and effective one-stage model, to text generations in letters and sentences / paragraphs with both standard and artistic / handwriting styles. The extensive experiments and comparisons demonstrate that our approach outperforms the state-of-the-art methods on the collected datasets including 20K fonts (letters and punctuations) with different styles.

Abstract:
Machine learning fairness concerns about the biases towards certain protected or sensitive group of people when addressing the target tasks. This paper studies the debiasing problem in the context of image classification tasks. Our data analysis on facial attribute recognition demonstrates (1) the attribution of model bias from imbalanced training data distribution and (2) the potential of adversarial examples in balancing data distribution. We are thus motivated to employ adversarial example to augment the training data for visual debiasing. Specifically, to ensure the adversarial generalization as well as cross-task transferability, we propose to couple the operations of target task classifier training, bias task classifier training, and adversarial example generation. The generated adversarial examples supplement the target task training dataset via balancing the distribution over bias variables in an online fashion. Results on simulated and real-world debiasing experiments demonstrate the effectiveness of the proposed solution in simultaneously improving model accuracy and fairness. Preliminary experiment on few-shot learning further shows the potential of adversarial attack-based pseudo sample generation as alternative solution to make up for the training data lackage.

Abstract:
"Draw portraits by music", an interactive work of art. Compared with music visualization and image style conversion, it's AI's imitation of human synaesthetic. New portraits gradually appear on the screen and are synchronized with music in real-time. Users select music and images as the main interactive contents, the parameters of the music are used as the dynamic expression of human emotions, and the new pixel generation process of the image is regarded as the result of emotions affecting humans.

Abstract:
"Keep Running" is a collection of human and machine generated paintings using a generative adversarial network technology. The horse artworks are produced during the lockdown period in the Middle East due to the Covid-19. Many recent AI artworks are either generated in photo-realistic style, or abstract style with distorted faces, fragmented figures and a combination of unknown objects. Besides the cultural and historic symbols that horses represent in this region, what's unique with our work is showing the possibility of using AI to create horse paintings with distinguishable features and forms, while still rendering different aesthetic and even sentimental expressions in the horse paintings. Our first artwork is a series of storytelling-like paintings of an evolving horse figure in motion with changing backgrounds. Another one is a set of different horse portrait paintings that are presented in a grid with each of them evolved and generated stylishly from the same yet repeated machine processes. Our AI artworks are not just artistic and meaningful, but also paying a salute to the early works of machine-assisted art by Eadweard Muybridge and Andy Warhol, for their influences to the art world today.

Abstract:
The traditional medical diagnosis methods of ADHD mainly rely on scale evaluation and interview observation. The diagnosis conclusion is subjective and extremely dependent on the doctor's experience level. There is an urgent need to improve diagnosis efficiency and improve the diagnosis standard through other technical means in the clinical process. We have designed and developed the ADHD intelligent auxiliary diagnosis system with software and hardware cooperation. The system performs a set of functional test tasks, uses a camera module to capture multimodal information such as facial expressions, eye movements, limb movements, language expressions and reaction abilities of children during task completion, and uses computer vision technology to automatically extract measurable characteristics. Finally, deep learning technology is used to detect children's specific behaviors in the video, which is complementary to the existing doctor's diagnosis basis. This system was deployed in the Department of Psychology of Children's Hospital of Zhejiang University in July 2019 and has been used in actual clinical diagnosis to date. It has completed the testing and evaluation of hundreds of ADHD children.

Abstract:
Existing Automatic Speech Recognition (ASR) systems usually generate the N-best hypotheses list first, and then rescore them with the language model score and the acoustic model score to find the best one. This procedure is essentially analogous to the working mechanism of modern Information Retrieval (IR) systems, which retrieve a relatively large amount of relevant candidates first, re-rank them, and output the top-N list. Exploiting their commonality, this demonstration proposes a novel system named GoldenRetriever that marries IR with ASR. GoldenRetriever transforms the problem of N-best hypotheses rescoring as a Learning-to-Rescore (L2RS) problem and utilizes a wide range of features beyond the language model score and the acoustic model score. In this demonstration, the audience can experience the great potential of marrying IR with ASR for the first time. GoldenRetriever should inspire more research on transferring the state-of-the-art IR techniques to ASR.

Abstract:
Videos become prevalent for storytellers to inspire viewers' interests. To further enhance narrations, visualizations are integrated into videos to present data-driven insights. However, manually crafting such data-driven videos is difficult and time-consuming. Thus, we present SmartShots, a system that facilitates the automatic integration of in-video visualizations. Specifically, we propose a computational framework that integrates non-verbal video clips, images, a melody, and a data table to create a video with data visualizations embedded. The system automatically translates the multi-media material into shots and then combines the shots into a compelling video. In addition, we develop a set of post-editing interactions to incorporate users' design knowledge and help them re-edit the automatically-generated videos.

Abstract:
To further enhance the immersion perception of remote interaction, avatars can be involved harnessing Head Mounted Display (HMD) based Augmented Reality (AR). In our demonstration, we present an avatar based remote interaction system AvatarMeeting, enabling users to meet with remote peers through interactive personalized avatars just like face to face. Specifically, we propose a novel framework including a consumer-grade set-up, a complete transmission scheme and a processing pipeline, which consists of prescan modeling, pose detection and action reconstruction. And an angle based reconstruction approach is introduced to empower the AR avatars to perform the same actions as each remote real person do in real time smoothly while keeping a good avatar shape.

Abstract:
The social media prediction task is aiming at predicting content popularity which includes social multimedia data such as photos, videos, and news. The task can not only help make better decisions for recommendation, but also reveals the public attention from evolutionary social systems. In this paper, we propose a novel approach named curriculum learning for wide multimedia-based transformer with graph target detection(CL-WMTG). The curriculum learning is designed for the transformer to improve the efficiency of model convergence. The mechanism of wide multimedia-based transformer is to make the model capable of learning cross information from text, pictures and other features(e.g. categories, location). Moreover, the graph target detection part can extract different features in the picture by pretrained model and reconstruct the features with a homogeneous graph network. We achieved third place in the SMP Challenge 2020.

Abstract:
This paper explores a simple and efficient baseline for multi-class and multiple objects tracking on VidOR dataset. The task is to build a robust object tracker that not only localize objects with bounding boxes in every video frame but also link the bounding boxes that indicate the same object entity into a trajectory. The task's challenges are the low resolution and imbalance of data and the disappearance of the object for a long time. According to the above characteristics, we design a robust detection model, proposed a new deep metric learning method, and explored some useful tracking algorithms to help complete the video object detection task.

Abstract:
The dynamic feature extracted by the 3D convolutional network and the static feature extracted by CNN are proved to be beneficial for video captioning. We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. However, we notice that the dynamic feature is not compatible with vision-language pre-training techniques when the frame length distribution and average pixel difference of training video and test video biases. Consequently, we directly train the XlanV model on the MSR-VTT dataset without pre-training on the GIF dataset in this challenge. The proposed XlanV model reaches the 1st place in the pre-training for video captioning challenge, which shows that substantially exploiting the dynamic feature is more effective than vision-language pre-training in this challenge.

Abstract:
Video-based human pose estimation in crowed scenes is a challenging problem due to occlusion, motion blur, scale variation and viewpoint change, etc. Prior approaches always fail to deal with this problem because of (1) lacking of usage of temporal information; (2) lacking of training data in crowded scenes. In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data. In particular, we first follow the top-down strategy to detect persons and perform single-person pose estimation for each frame. Then, we refine the frame-based pose estimation with temporal contexts deriving from the optical-flow. Specifically, for one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos. In addition, we mine new data of similar scenes to HIE dataset from the Internet for improving the diversity of training set. In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average wAP on test dataset of HIE challenge.

Abstract:
Anomaly detection in the city scenario is a fundamental computer vision task and plays a critical role in city management and public safety. Although it has attracted intense attention in recent years, it remains a very challenging problem due to the complexity of the city environment, the serious imbalance between normal and abnormal samples, and the ambiguity of the concept of abnormal behavior. In this paper, we propose a modularized framework to perform general and specific anomaly detection. A video segment extraction module is first employed to obtain the candidate video segments. Then an anomaly classification network is introduced to predict the abnormal score for each category. A category-sensitive abnormal filter is concatenated after the classification model to filter the abnormal event from the candidate video clips. It is helpful to alleviate the impact of the imbalance of abnormal categories in the test phase and obtain more accurate localization results. The experimental results reveal that our framework obtains a 66.41 MF1 in the test set of the CitySCENE Challenge 2020, which ranks first in the specific anomaly detection task.

Abstract:
Detecting and recognizing human action in videos with crowed scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).

Abstract:
The third ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'20) is part of the ACM International Conference on Multimedia 2020 (ACM Multimedia 2020). Exceptionally, due to the corona pandemic, the workshop is held virtually. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding and visualizing the multimedia/multimodal data in sports. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation, understanding, statistical analysis and evaluation, and sensor fusion. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports.

Abstract:
The widespread emergence and deployment of inexpensive sensors has resulted in the generation of enormous amounts of digital data in today's world. While this has expanded the possibilities of solving real world problems using computational learning frameworks, selecting the salient data samples from such huge collections of data has proved to be a significant and practical challenge. Further, to train a reliable classification model, it is important to have a large quantity of labeled training data. Manual annotation of large amounts of data is an expensive process in terms of time, labor and human expertise. This has set the stage for research in the field of active learning. Active learning algorithms automatically select the salient and exemplar instances from large quantities of unlabeled data and thereby tremendously reduce human annotation effort in training an effective classifier. It can be applied across all existing classification / regression methods and with any kind of data, thus making it a very generalizable approach. The success of active learning in several applications (such as image retrieval, image recognition) has resulted in the extension of the framework to problem settings beyond regular classification / regression. Active learning concepts have been extended to newer problem settings (such as feature selection, video summarization, matrix completion) and have also been combined with other learning paradigms such as deep learning and transfer learning. This tutorial will seek to present a comprehensive overview of active learning with a focus on multimedia computing applications, including historical perspectives, theoretical analysis and novel paradigms. The novelty of this tutorial lies in its focus on the emerging trends, algorithms and applications of active learning. It will aim at introducing concepts and open perspectives that motivate further work in this domain, ranging from fundamentals to applications and systems.

Abstract:
With the emergence of new 360-degree cameras, ambisonic microphones, and VR/AR display devices, more diverse multi-modal content has become available, and with it the demand for the capability of streaming 360-degree videos to enhance users? 360-multimedia experience on mobile devices such as mobile phones and head-mounted displays. The big issue for the mobile 360-multimedia delivery systems is the huge resource demand on the underlying networks and devices to deliver 360-multimedia content with high quality of experience. In this talk, we will discuss the research challenges of 360-degree video delivery systems such as the large bandwidth, low latency, users? disorientation, and cyber-sickness, and opportunities to solve these challenges including rate adaptation algorithms of tiles videos, view prediction algorithms, content navigation, enhancement of DASH streaming for 360-videos, and control of Quality of Experience (QoE) [1]. We will briefly dive into more details of the concept of navigation graphs for 360-degree videos and present the opportunity of navigation graphs to organize 360-video content that can help in viewing navigation, caching and improvements of QoE [2]. We will show how navigation graphs are serving as models for viewing behaviors in the temporal and spatial domains, and can assist with view predictions, bandwidth, and latency control. Our experimental results are encouraging [3] and support the intuition that if we can encapsulate viewing patterns of 360-degree videos into navigation graphs at multiple levels of contextual details, we will be able to stream "need-to-see" 360-content to wireless HMD devices in timely manner within bandwidth-constrained environments, and enhance viewing quality experience of 360-degree videos in augmented reality applications.

Abstract:
Counting people automatically through computer vision technology is a challenging task. Recently, convolution neural network (CNN) based methods have made significant progress. Nonetheless, large scale variations of instances caused by, for example, perspective effects remain unsolved. Moreover, it is problematic to estimate scales with only point annotations. In this paper, we propose a scale-aware probabilistic model to handle this problem. Unlike previous methods that generate a single density map where instances of various scales are processed indiscriminately, we propose a density pyramid network (DPN), where each pyramid level handles instances within a particular scale range. Furthermore, we propose a scale distribution estimator (SDE) to learn scales of people from input data, under the weak supervision of point annotations. Finally, we adopt an instance-level probabilistic scale-aware model (IPSM) to guide the multi-scale training of DPN explicitly. Qualitative and quantitative experimental results demonstrate the effectiveness of the proposed method, which achieves competitive results on four widely used benchmarks.

Abstract:
Video streaming commonly uses Dynamic Adaptive Streaming over HTTP (DASH) to deliver good Quality of Experience (QoE) to users. Videos used in DASH are predominantly encoded by single-layered video coding such as H.264/AVC. In comparison, multi-layered video coding such as H.264/SVC provides more flexibility for upgrading the quality of buffered video segments and has the potential to further improve QoE. However, there are two challenges for using SVC in DASH: (i) the complexity in designing ABR algorithms; and (ii) the negative impact of SVC's coding overhead. In this work, we propose a deep reinforcement learning method called Grad for designing ABR algorithms that take advantage of the quality upgrade mechanism of SVC. Additionally, we quantify the impact of coding overhead on the achievable QoE of SVC in DASH, and propose jump-enabled hybrid coding (HYBJ) to mitigate the impact. Through emulation, we demonstrate that Grad-HYBJ, an ABR algorithm for HYBJ learned by Grad, outperforms the best performing state-of-the-art ABR algorithm by 17% in QoE.

Abstract:
Motion-blurred images are the result of light accumulation over the period of camera exposure time, during which the camera and objects in the scene are in relative motion to each other. The inverse process of extracting an image sequence from a single motion-blurred image is an ill-posed vision problem. One key challenge is that the motions across frames are subtle, which makes the generating networks difficult to capture them and thus the recovery sequences lack motion details. In order to alleviate this problem, we propose a detail-aware network with three consecutive stages to improve the reconstruction quality by addressing specific aspects in the recovery process. The detail-aware network firstly models the dynamics using a cycle flow loss, resolving the temporal ambiguity of the reconstruction in the first stage. Then, a GramNet is proposed in the second stage to refine subtle motion between continuous frames using Gram matrices as motion representation. Finally, we introduce a HeptaGAN in the third stage to bridge the continuous and discrete nature of exposure time and recovered frames, respectively, in order to maintain rich detail. Experiments show that the proposed detail-aware networks produce sharp image sequences with rich details and subtle motion, outperforming the state-of-the-art methods.

Abstract:
Given the massive market of advertising and the sharply increasing online multimedia content (such as videos), it is now fashionable to promote advertisements (ads) together with the multimedia content. However, manually finding relevant ads to match the provided content is labor-intensive, and hence some automatic advertising techniques are developed. Since ads are usually hard to understand only according to its visual appearance due to the contained visual metaphor, some other modalities, such as the contained texts, should be exploited for understanding. To further improve user experience, it is necessary to understand both the ads' topic and sentiment. This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. In particular, in our framework termed DeepM^2Ad, we first extract multimodal information from ads and learn high-level and comparable representations. The visual metaphor of the ad is decoded in an unsupervised manner. The obtained representations are then fed into the proposed hierarchical multimodal attention modules to learn task-specific representations for final prediction. A multitask loss function is also designed to jointly train both the topic and sentiment prediction models in an end-to-end manner, where bottom-layer parameters are shared to alleviate over-fitting. We conduct extensive experiments on a large-scale advertisement dataset and achieve state-of-the-art performance for both prediction tasks. The obtained results could be utilized as a benchmark for ads understanding.

Abstract:
We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). We hypothesize that manipulation of either modality will lead to dis-harmony between the two modalities, e.g., loss of lip-sync, unnatural facial and lip movements, etc. MDS is computed as the mean aggregate of dissimilarity scores between audio and visual segments in a video. Discriminative features are learnt for the audio and visual channels in a chunk-wise manner, employing the cross-entropy loss for individual modalities, and a contrastive loss that models inter-modality similarity. Extensive experiments on the DFDC and DeepFake-TIMIT Datasets show that our approach outperforms the state-of-the-art by up to 7%. We also demonstrate temporal forgery localization, and show how our technique identifies the manipulated video segments.

Abstract:
Recently, developing temporally consistent video-based processing techniques has drawn increasing attention due to the defective extend-ability of existing image-based processing algorithms (e.g., filtering, enhancement, colorization, etc). Generally, applying these image-based algorithms independently to each video frame typically leads to temporal flickering due to the global instability of these algorithms. In this paper, we consider enforcing temporal consistency in a video as a temporal denoising problem that removing the flickering effect in given unstable pre-processed frames. Specifically, we propose a novel model termed Temporal Denoising Mask Synthesis Network (TDMS-Net) that jointly predicts the motion mask, soft optical flow and the refining mask to synthesize the temporal consistent frames. The temporal consistency is learned from the original video and the learned temporal features are applied to reprocess the output frames that are agnostic (blind) to specific image-based processing algorithms. Experimental results on two datasets for 16 different applications demonstrate that the proposed TDMS-Net significantly outperforms two state-of-the-art blind temporal consistency approaches.

Abstract:
An important application of affective image annotation is affective image content analysis, which aims to automatically understand the emotion being brought to viewers by image contents. The so-called subjective perception issue, i.e., different viewers may have different emotional responses to the same image, makes it difficult to link image features with the expected perceived emotion. Due to the ability to learn features, recent deep learning technologies have opened a new window on affective image content analysis, which has led to a growing demand for affective image annotation technologies to build large reliable training datasets. This paper proposes a novel affective image annotation technique, AffectI, for efficiently collecting diverse and reliable emotional labels with the estimate emotion distribution for images based on the concept of Game With a Purpose (GWAP). AffectI features three novel mechanisms: a selection mechanism for ensuring all emotion words being fairly evaluated for collecting diverse and reliable labels; an estimation mechanism for estimating the emotion distribution by aggregating partial pairwise comparisons of the emotion words for collecting the labels effectively and efficiently; an incentive mechanism shows the comparison between current player and her opponents as well as all past players to promote the interest of players and also contributes the reliability and diversity. Our experimental results demonstrate that AffectI is superior to existing methods in terms of being able to collect more diverse and reliable labels. The advantage of using GWAP for reducing the frustration of evaluators was also confirmed through subjective evaluation.

Abstract:
In cloud and edge networks, federated learning involves training statistical models over decentralized data, where servers aggregate models through intermediate updates trained from clients. By utilizing private and local data it improves quality of personalized services and reduces user's concern for privacy. However, federated learning still leaks multimedia features through trained intermediate updates and thereby is not privacy-preserving for multimedia. Existing techniques applied from secure community attempt to avoid multimedia features leakages for federated learning but yet cannot address issues of privacy. In this paper, we propose a privacy-preserving solution that avoids multimedia privacy leakages in federated learning. Firstly, we devise a novel encryption scheme called Non-Informative Transformation (NIT) for federated aggregation to eliminates residual multimedia features in intermediate updates. Based on the scheme, we then propose Just-Learn-over-Ciphertext (JLoC) mechanism for federated learning, which includes three stages in each model iteration. The Encrypt stage encrypts intermediate updates and makes it non-informative distribution at clients. The Aggregate stage performs model aggregation without decryption at servers. Specifically, this stage just computes over ciphertext, and its output of aggregation also keeps non-informative. The Decrypt stage converts non-informative outputs of aggregation to available parameters for the next iteration at clients. Moreover, we implement a prototype and conduct experiments to evaluate its privacy and performance on real devices. The experimental results demonstrate that our methods can defend against potential attacks for multimedia privacy leakages without accuracy loss in commercial off-the-shelf products.

Abstract:
Visible thermal person re-identification (VT-REID) is an important and challenging task in that 1) weak lighting environments are inevitably encountered in real-world settings and 2) the inter-modality discrepancy is serious. Most existing methods either aim at reducing the cross-modality gap in pixel- and feature-level or optimizing cross-modality network by metric learning techniques. However, few works have jointly considered these two aspects and studied their mutual benefits. In this paper, we design a novel framework to jointly bridge the modality gap in pixel- and feature-level without additional parameters, as well as reduce the inter- and intra-modalities variations by a center-guided metric learning constraint. Specifically, we introduce the Class-aware Modality Mix (CMM) to generate internal information of the two modalities for reducing the modality gap in pixel-level. In addition, we exploit the KL-divergence to further align modality distributions on feature-level. On the other hand, we propose an efficient Center-guided Metric Learning (CML) method for decreasing the discrepancy within the inter- and intra-modalities, by enforcing constraints on class centers and instances. Extensive experiments on two datasets show the mutual advantage of the proposed components and demonstrate the superiority of our method over the state of the art.

Abstract:
This paper proposes a new evaluation approach for video summarization algorithms. We start by studying the currently established evaluation protocol; this protocol, defined over the ground-truth annotations of the SumMe and TVSum datasets, quantifies the agreement between the user-defined and the automatically-created summaries with F-Score, and reports the average performance on a few different training/testing splits of the used dataset. We evaluate five publicly-available summarization algorithms under a large-scale experimental setting with 50 randomly-created data splits. We show that the results reported in the papers are not always congruent with their performance on the large-scale experiment, and that the F-Score cannot be used for comparing algorithms evaluated on different splits. We also show that the above shortcomings of the established evaluation protocol are due to the significantly varying levels of difficulty among the utilized splits, that affect the outcomes of the evaluations. Further analysis of these findings indicates a noticeable performance correlation among all algorithms and a random summarizer. To mitigate these shortcomings we propose an evaluation protocol that makes estimates about the difficulty of each used data split and utilizes this information during the evaluation process. Experiments involving different evaluation settings demonstrate the increased representativeness of performance results when using the proposed evaluation approach, and the increased reliability of comparisons when the examined methods have been evaluated on different data splits.

Abstract:
Anxiety is the most common mental problem that affects nearly 300 million individuals worldwide. The situation is even worse recently. In clinical practice, music therapy has been used for more than forty years because of its effectiveness and few side effects in emotion regulation. This paper proposes a novel style transfer model to generate the therapeutic music according to user's preference. It is widely recognized that the favorite music greatly increases the engagement of the user, hence results in much better curative effects. But in general, users can provide only one or several favorite songs, which are insufficient for the customization of therapeutic music. To address this difficulty, a new domain adaption algorithm that transfers the learning result for music genre classification to the music personalization, is designed. Targeting the joint minimization of the loss functions, three convolutional neural networks are utilized to generate the therapeutic music with only one labelled data of favorite song. The experiment on the anxiety suffers shows that the customized therapeutic music has achieved better and stable performance in anxiety reduction.

Abstract:
In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). KAB-NAS regards the design of each layer as a k-armed bandit problem and updates the preference of each candidate via numerous samplings in a single-shot search framework. To establish an effective search space, we further propose a new architecture termed Automatic Graph Attention Network (AGAN), and extend the popular self-attention layer with three graph structures, denoted as dense-graph, co-graph and separate-graph.These graph layers are used to form the direction of information propagation in the graph network, and their optimal combinations are searched by KAB-NAS. To evaluate KAB-NAS and AGAN, we conduct extensive experiments on two VQA benchmark datasets, i.e., VQA2.0 and GQA, and also test AGAN with the popular BERT-style pre-training. The experimental results show that with the help of KAB-NAS, AGAN can achieve the state-of-the-art performance on both benchmark datasets with much fewer parameters and computations.

Abstract:
Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms existing models with state-of-the-art results.

Abstract:
Advances in deep neural networks have considerably improved the art of animating a still image without operating in 3D domain. Whereas, prior arts can only animate small images (typically no larger than 512x512) due to memory limitations, difficulty of training and lack of high-resolution (HD) training datasets, which significantly reduce their potential for applications in movie production and interactive systems. Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images.

Abstract:
Retinal images have been widely used by clinicians for early diagnosis of ocular diseases. However, the quality of retinal images is often clinically unsatisfactory due to eye lesions and imperfect imaging process. One of the most challenging quality degradation issues in retinal images is non-uniform which hinders the pathological information and further impairs the diagnosis of ophthalmologists and computer-aided analysis. To address this issue, we propose a non-uniform illumination removal network for retinal image, called NuI-Go, which consists of three Recursive Non-local Encoder-Decoder Residual Blocks (NEDRBs) for enhancing the degraded retinal images in a progressive manner. Each NEDRB contains a feature encoder module that captures the hierarchical feature representations, a non-local context module that models the context information, and a feature decoder module that recovers the details and spatial dimension. Additionally, the symmetric skip-connections between the encoder module and the decoder module provide long-range information compensation and reuse. Extensive experiments demonstrate that the proposed method can effectively remove the non-uniform illumination on retinal images while well preserving the image details and color. We further demonstrate the advantages of the proposed method for improving the accuracy of retinal vessel segmentation.

Abstract:
An effective video classification method by means of a small number of samples is urgently needed. The deficiency of samples could be alleviated by generating samples through generative adversarial networks (GANs). However, the generation of videos in a typical category remains underexplored because the complex actions and the changeable viewpoints are difficult to simulate. Thus, applying GANs to perform video augmentation is difficult. In this study, we propose a generative data augmentation method for video classification using dynamic images. The dynamic image compresses the motion information of a video into a still image, removing the interference factors such as the background. Thus, utilizing the GANs to augment dynamic images can keep the categorical motion information and save memory compared with generating videos. To deal with the uneven quality of generated images, we propose a self-paced selection method to automatically select high-quality generated samples for training. These selected dynamic images are used to enhance the features, attain regularization, and finally achieve video augmentation. Our method is verified on two benchmark datasets, namely, HMDB51 and UCF101. Experimental results show that the method remarkably improves the accuracy of video classification under the circumstance of sample insufficiency and sample imbalance.

Abstract:
Video style transfer is a challenging task that requires not only stylizing video frames but also preserving temporal consistency among them. Many existing methods resort to optical flow for maintaining the temporal consistency in stylized videos. However, optical flow is sensitive to occlusions and rapid motions, and its training processing speed is quite slow, which makes it less practical in real-world applications. In this paper, we propose a novel fast method that explores both global and local temporal consistency for video style transfer without estimating optical flow. To preserve the temporal consistency of the entire video (i.e., global consistency), we use structural similarity index instead of flow optical and propose a self-similarity loss to ensure the temporal structure similarity between the stylized video and the source video. Furthermore, to enhance the coherence between adjacent frames (i.e., local consistency), a self-attention mechanism is designed to attend the previous stylized frame for synthesizing the current frame. Extensive experiments demonstrate that our method generally achieves better visual results and runs faster than the state-of-the-art methods, which validates the superiority of simultaneously preserving global and local temporal consistency for video style transfer

Abstract:
Facial skin texture synthesis is a fundamental problem in high-quality facial image generation and enhancement. The key behind is how to effectively synthesize plausible textured noise for the faces. With the development of CNNs and GANs, most works cast the problem as an image to image translation problem. However, these methods lack an explicit mechanism to simulate the facial noise pattern, so that the generated images are of obvious artifacts. To this end, we propose a new facial noise generation method. Specifically, we utilize the property of blue noise and Gabor filter to implicitly guide the asymmetrical sampling for the face region as a guidance map, where non-uniform point sampling is conducted. Thus we propose a novel Blue-Noise Gabor Module to produce a spatial-variant noisy image. Our proposed two-branch framework combined facial identity enhancing with textures details generation to jointly produce a high-quality facial image. Experimental results demonstrate the superiority of our method compared with the state-of-the-art, which enables the generation of high-quality facial texture based on a 2D image only, without the involvement of any 3D models.

Abstract:
The deficiency of labeled training data is one of the bottlenecks in 3D hand pose estimation from monocular RGB images. Synthetic datasets have a large number of images with precise annotations, but their obvious difference with real-world datasets limits the generalization ability. Few efforts have been made to bridge the gap between the two domains in terms of their large differences. In this paper, we propose a domain adaptation method called Adaptive Wasserstein Hourglass for weakly-supervised 3D hand pose estimation to close the large gap between synthetic and real-world datasets flexibly. Adaptive Wasserstein Hourglass utilizes a feature similarity metric to identify the differences and explore the common features (e.g., hand structure) of the two datasets. Common features are drawn close adaptively during the training, whereas domain-specific features retain the differences. Learning common features helps the network in focusing on pose-related information, whereas maintaining domain-specific features reduces the optimization difficulty when closing the big gap between two domains. Extensive evaluations on two benchmark datasets demonstrate that our method succeeds in distinguishing different features and achieves optimal results when compared with state-of-the-art 3D pose estimation approaches and domain adaptation methods.

Abstract:
Gesture and fingertip are becoming more and more important mediums for human-computer interaction (HCI). Therefore, algorithms of gesture recognition and fingertip detection have been extensively investigated. However, problems mainly remain in how to achieve a win-win situation between speed and accuracy, and how to deal with complex interaction environment. To rectify these problems, this paper proposes an attention-based dual branches network that can efficiently fulfill both fingertip detection and gesture recognition tasks. In order to deal with complex interaction environment, we combine both channel-wise attention and spatial-wise attention into the fingertip detection model. The extensive experiments demonstrate that our novel model is both effective and efficient. In the experiment, our proposed model achieves the average fingertip detection error at around 2.8 pixels in 640×480 video frame, and the average recognition accuracy among eight gestures reaches 99%. Moreover, the average forward time is about 8 ms. Due to the light-weight design, this model can also achieve high-efficiency performance on CPU. In addition, we design a virtual key system based on our proposed model, which can allow users to complete the "clicking" operation naturally in virtual environment. Our proposed system can perform well with a single normal RGB camera without any pre-processing (e.g., image segmentation or contour extraction), which can significantly reduce the complexity of the interaction system.

Abstract:
Facial micro-expressions (MEs) recognition has attracted much attention recently. However, because MEs are spontaneous, subtle and transient, recognizing MEs is a challenge task. In this paper, first, we use transfer learning to apply learning-based video motion magnification to magnify MEs and extract the shape information, aiming to solve the problem of the low muscle movement intensity of MEs. Then, we design a novel graph-temporal convolutional network (Graph-TCN) to extract the features of the local muscle movements of MEs. First, we define a graph structure based on the facial landmarks. Second, the Graph-TCN deals with the graph structure in dual channels with a TCN block. One channel is for node feature extraction, and the other one is for edge feature extraction. Last, the edges and nodes are fused for classification. The Graph-TCN can automatically train the graph representation to distinguish MEs while not using a hand-crafted graph representation. To the best of our knowledge, we are the first to use the learning-based video motion magnification method to extract the features of shape representations from the intermediate layer while magnifying MEs. Furthermore, we are also the first to use deep learning to automatically train the graph representation for MEs.

Abstract:
Masked faces recognition (MFR) aims to match a masked face with its corresponding full face, which is an important task especially during the global outbreak of COVID-19. However, most existing face recognition models generalize poorly in this case, and it is hard to train a robust MFR model due to two main reasons: 1) the absence of large scale training data as well as ground truth testing data, and 2) the presence of large intra-class variation between masked faces and full faces. To address the first challenge, this paper firstly contributes a new dataset denoted as MFSR, which consists of two parts. The first part contains 9,742 masked face images with mask region segmentation annotation. The second part contains 11,615 images of 1,004 identities, and each identity has masked and full face images with various orientations, lighting conditions and mask types. However, it is still not enough for training MFR models with deep learning. To obtain sufficient training data, based on the MFSR, we introduce a novel Identity Aware Mask GAN (IAMGAN) with segmentation guided multi-level identity preserve module to generate the synthetic masked face images from the full face images. In addition, to tackle the second challenge, a Domain Constrained Ranking (DCR) loss is proposed by adopting a center-based cross-domain ranking strategy. For each identity, two centers are designed which correspond to the full face images and the masked face images respectively. The DCR forces the feature of masked faces getting closer to its corresponding full face center and vice-versa. Experimental results on the MFSR dataset demonstrate the effectiveness of the proposed approaches.

Abstract:
Most existing RGB-D salient object detection (SOD) methods directly extract and fuse raw features from RGB and depth backbones. Such methods can be easily restricted by low-quality depth maps and redundant cross-modal features. To effectively capture multi-scale cross-modal fusion features, this paper proposes a novel Multi-stage and Multi-Scale Fusion Network (MMNet), which consists of a cross-modal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). Similar to the mechanism of visual color stage doctrine in human visual system, the proposed CMFM aims to explore the useful and important feature representations in feature response stage, and effectively integrate them into available cross-modal fusion features in adversarial combination stage. Moreover, the proposed BMD learns the combination of cross-modal fusion features from multiple levels to capture both local and global information of salient objects and further reasonably boost the performance of the proposed method. Comprehensive experiments demonstrate that the proposed method can achieve consistently superior performance over the other 14 state-of-the-art methods on six popular RGB-D datasets when evaluated by 8 different metrics.

Abstract:
The objective of action quality assessment is to score sports videos. However, most existing works focus only on video dynamic information (i.e., motion information) but ignore the specific postures that an athlete is performing in a video, which is important for action assessment in long videos. In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. To learn more discriminative representations for videos, we not only learn the video dynamic information but also focus on the static postures of the detected athletes in specific frames, which represent the action quality at certain moments, along with the help of the proposed hybrid dynamic-static architecture. Moreover, we leverage a context-aware attention module consisting of a temporal instance-wise graph convolutional network unit and an attention unit for both streams to extract more robust stream features, where the former is for exploring the relations between instances and the latter for assigning a proper weight to each instance. Finally, we combine the features of the two streams to regress the final video score, supervised by ground-truth scores given by experts. Additionally, we have collected and annotated the new Rhythmic Gymnastics dataset, which contains videos of four different types of gymnastics routines, for evaluation of action quality assessment in long videos. Extensive experimental results validate the efficacy of our proposed method, which outperforms related approaches.

Abstract:
Due to the huge volume of point cloud data, storing or transmitting it is currently difficult and expensive in autonomous driving. Learning from the high efficiency video coding (HEVC) coding framework, we propose an advanced coding scheme for large-scale LiDAR point cloud sequences, in which several techniques have been developed to remove the spatial and temporal redundancy. The proposed strategy consists mainly of intra-coding and inter-coding. For intra-coding, we utilize a cluster-based prediction method to remove the spatial redundancy. For inter-coding, a predictive recurrent network is designed, which is capable of generating future frames according to the previously encoded frames. By calculating the residual error between the predicted and real point cloud data, the temporal redundancy can be removed. Finally, the residual data is quantized and encoded by lossless coding schemes. Experiments are conducted on the KITTI data set with four different scenes to verify the effectiveness and efficiency of the proposed method. Our approach can deal with multiple types of point cloud data from the simple to more complex, and yields better performance in terms of compression ratio compared with octree, Google Draco, MPEG TMC13 and other recently proposed methods.

Abstract:
We present a learning-based method for detecting real and fake deepfake multimedia content. To maximize information for learning, we extract and analyze the similarity between the two audio and visual modalities from within the same video. Additionally, we extract and compare affective cues corresponding to perceived emotion from the two modalities within a video to infer whether the input video is "real" or "fake". We propose a deep learning network, inspired by the Siamese network architecture and the triplet loss. To validate our model, we report the AUC metric on two large-scale deepfake detection datasets, DeepFake-TIMIT Dataset and DFDC. We compare our approach with several SOTA deepfake detection methods and report per-video AUC of 84.4% on the DFDC and 96.6% on the DF-TIMIT datasets, respectively. To the best of our knowledge, ours is the first approach that simultaneously exploits audio and video modalities and also perceived emotions from the two modalities for deepfake detection.

Abstract:
Multimedia stimulation of brain activities has not only become an emerging field for intensive research, but also achieves important progress in the electroencephalogram (EEG) emotion classification based on brain activities. However, how to make full use of different EEG features and the discriminative local patterns among the features for different emotions is challenging. Existing models ignore the complementarity among the spatial-spectral-temporal features and discriminative local patterns in all features, which limits the classification ability of the models to a certain extent. In this paper, we propose a novel spatial-spectral-temporal based attention 3D dense network, named SST-EmotionNet, for EEG emotion recognition. The main advantage of the SST-EmotionNet is the simultaneous integration of spatial-spectral-temporal features in a unified network framework. Meanwhile, a 3D attention mechanism is designed to adaptively explore discriminative local patterns. Extensive experiments on two real-world datasets demonstrate that the SST-EmotionNet outperforms the state-of-the-art baselines.

Abstract:
Although facial expression recognition has improved in recent years, it is still very challenging to recognize expressions from occluded facial images in the wild. Due to the lack of large-scale facial expression datasets with diversity of the type and position of occlusions, it is very difficult to learn robust occluded expression classifier directly from limited occluded images. Considering facial images without occlusions usually provide more information for facial expression recognition compared to occluded facial images, we propose a step-wise learning strategy for occluded facial expression recognition that utilizes unpaired non-occluded images as guidance in the feature and label space. Specifically, we first measure the complexity of non-occluded data using distribution density in a feature space and split data into three subsets. In this way, the occluded expression classifier can be guided by basic samples first, and subsequently leverage more meaningful and discriminative samples. Complementary adversarial learning techniques are applied in the global-level and local-level feature space throughout, forcing the distribution of the occluded features to be close to the distribution of the non-occluded features. We also take the variability of the different images' transferability into account via adaptive classification loss. Loss inequality regularization is imposed in the label space to calibrate the output values of the occluded network. Experimental results show that our method improves performance on both synthesized occluded databases and realistic occluded databases.

Abstract:
The dependence among emotions is crucial to boost emotion tagging. In this paper, we propose a novel emotion tagging method, that thoroughly explores emotion relations from both the feature and label levels. Specifically, a graph convolutional network is introduced to inject local dependence among emotions into the model at the feature level, while an adversarial learning strategy is applied to constrain the joint distribution of multiple emotions at the label level. In addition, a new balanced loss function that mitigates the adverse effects of intra-class and inter-class imbalance is introduced to deal with the imbalance of emotion labels. Experimental results on several benchmark databases demonstrate the superiority of the proposed method compared to state-of-the-art works.

Abstract:
Fonts carry strong emotional and social signals, and can affect user engagement in significant ways. Hence, selecting the right font is a very important step in the design of a multimodal artifact with text. Currently, font exploration is frequently carried out via associated social tags. Users are expected to browse through thousands of fonts tagged with certain concepts to find the one that works best for their use case. In this study, we propose a new multimodal font discovery method in which users provide a reference font together with the changes they wish to obtain in order to get closer to their ideal font. This allows for efficient and goal-driven navigation of the font space, and discovery of fonts that would otherwise likely be missed. We achieve this by learning cross-modal vector representations that connect fonts and query words.

Abstract:
With the development of deep learning technologies, attribute recognition and person re-identification (re-ID) have attracted extensive attention and achieved continuous improvement via executing computing-intensive deep neural networks in cloud datacenters. However, the datacenter deployment cannot meet the real-time requirement of attribute recognition and person re-ID, due to the prohibitive delay of backhaul networks and large data transmissions from cameras to datacenters. A feasible solution thus is to employ mobile edge clouds (MEC) within the proximity of cameras and enable distributed inference.

Abstract:
By pushing computing functionalities to network edges, backhaul network bandwidth is saved and various latency requirements are met, providing support for diverse computation-intensive and delay-sensitive multimedia services. Due to the limited capabilities of edge nodes, it is very important to decide which services should be provided locally. This paper investigates the cloud-edge service offloading problem. Different from prior works which only give the proportion of computation offloading with constraint of computing capacity, we also take the storage space into account and determine the computing status of each service. We formulate the problem as a Markov decision process whose goal is to maximize the long-term average reduction of delay. The problem is hard to be solved with traditional methods because of the extremely large action space and lack of information about transition probability. Instead, this paper proposes an innovative deep reinforcement learning method to solve it. The proposed multi-update reinforcement learning algorithm introduces a novel exploration strategy and update method, which reduce dramatically the size of the action space. Extensive simulation-based testing shows that the proposed algorithm has fast convergence and improves the system performance more than other three alternative solutions do.

Abstract:
Visually-aware food recommendation recommends food items based on their visual features. Existing methods typically use the pre-extracted visual features from food classification models, which mainly encode the visual content with limited semantic information, such as the classes and ingredients. Therefore, such features may not cover the personalized visual preferences of users, termed collaborative information, e.g. users may attend to different colors and textures of food based on their preferred ingredients and cooking methods. To address this problem, this paper presents a heterogeneous multi-task learning framework, termed privileged-channel infused network (PiNet). It learns the visual features that contain both the semantic and collaborative information by training the image encoder to simultaneously fulfill the ingredient prediction and food recommendation tasks. However, the heterogeneity between the two tasks may lead to different visual information in need and different directions in model parameter optimization. To handle these challenges, PiNet first employs a dual-gating module (DGM) to enable the encoding and passing of different visual information from the image encoder to individual tasks. Secondly, PiNet adopts a two-phase training strategy and two prior knowledge incorporation methods to ensure an effective model training. Experimental results from two real-world datasets show that the visual features generated by PiNet better attend to the informative image regions, yielding superior performance.

Abstract:
Music genres are useful for indexing, organizing, searching, and recommending songs and albums. Therefore, the automatic classification of music genres is an essential part of almost all kinds of music applications. Recent works focus on exploiting text, audio, or multi-modal information for genre classification, without considering the influence of the artists' and listeners' preference. However, intuitively, artists have their composing preferences, and listeners also have their music tastes. Both of them provide helpful hints to the music genre from different views, which are crucial to improve classification performance.

Abstract:
Volumetric video (VV) streaming has drawn an increasing amount of interests recently with the rapid advancements in consumer VR/AR devices and the relevant multimedia and graphics research. While the resource and performance challenges in volumetric video streaming have been actively investigated by the multimedia community, the potential security and privacy concerns with this new type of multimedia have not been studied. We for the first time identify an effective threat model that extracts 3D face models from volumetric videos and compromises face ID-based authentications To defend against such attack, we develop a novel volumetric video security mechanism, namely VVSec, which makes benign use of adversarial perturbations to obfuscate the security and privacy-sensitive 3D face models. Such obfuscation ensures that the 3D models cannot be exploited to bypass deep learning-based face authentications. Meanwhile, the injected perturbations are not perceivable by the end-users, maintaining the original quality of experience in volumetric video streaming. We evaluate VVSec using two datasets, including a set of frames extracted from an empirical volumetric video and a public RGB-D face image dataset. Our evaluation results demonstrate the effectiveness of both the proposed attack and defense mechanisms in volumetric video streaming.

Abstract:
Previous studies of 360-degree video streaming with regard to virtual reality allowed users to move their head freely, while their position is fixed according to the camera's location in virtual reality. One of the approaches to overcome the problem is transmitting multiview video to provide six degrees of freedom (6DoF). However, 6DoF streaming system implementation is challenging because multiple high-quality video streaming requires several decoders and a high bandwidth. Therefore, this paper proposes a viewport-dependent high-efficiency video coding (HEVC)-compliant tiled streaming system on test model for immersive video (TMIV), MPEG-Immersive multiview compression reference software. This paper proposes a 6DoF viewport tile selector (VTS) for multiple 360-degree video tiled streaming. Furthermore, this paper introduces a viewport-dependent multiple-tile extractor. The proposed system detects the user's head movement, selects the tile sets that correspond to the user's viewport, extracts tile bitstreams, and generates single bitstream. The extracted bitstream is transmitted and decoded to render the user's viewport The proposed viewport-dependent streaming method can reduce the decoding time as well as the bandwidth. Experimental results demonstrated 12.04% bjontegaard delta rate (BD-rate) saving for the luma peak signal-to-noise ratio (PSNR) compared to those obtained via the TMIV anchor without tiled encoding and a 55.51% decoding time saving compared to those obtained via the TMIV anchor with the existing tiled streaming method.

Abstract:
When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

Abstract:
Query-based moment localization is a new task that localizes the best matched segment in an untrimmed video according to a given sentence query. In this localization task, one should pay more attention to thoroughly mine visual and linguistic information. To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. Specifically, the joint graph consists of Cross-Modal interaction Graph (CMG) and Self-Modal relation Graph (SMG), where frames and words are represented as nodes, and the relations between cross- and self-modal node pairs are described by an attention mechanism. Through parametric message passing, CMG highlights relevant instances across video and sentence, and then SMG models the pairwise relation inside each modality for frame (word) correlating. With multiple layers of such a joint graph, our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization. Besides, to better comprehend the contextual details in the query, we develop a hierarchical sentence encoder to enhance the query understanding. Extensive experiments on four public datasets demonstrate the effectiveness of our proposed model, and GCSMAN significantly outperforms the state-of-the-arts.

Abstract:
The recent generative model-driven Generalized Zero-shot Learning (GZSL) techniques overcome the prevailing issue of the model bias towards the seen classes by synthesizing the visual samples of the unseen classes through leveraging the corresponding semantic prototypes. Although such approaches significantly improve the GZSL performance due to data augmentation, they violate the principal assumption of GZSL regarding the unavailability of semantic information of unseen classes during training. In this work, we propose to use a generative model (GAN) for synthesizing the visual proxy samples while strictly adhering to the standard assumptions of the GZSL. The aforementioned proxy samples are generated by exploring the early training regime of the GAN. We hypothesize that such proxy samples can effectively be used to characterize the average entropy of the label distribution of the samples from the unseen classes. Further, we train a classifier on the visual samples from the seen classes and proxy samples using entropy separation criterion such that an average entropy of the label distribution is low and high, respectively, for the visual samples from the seen classes and the proxy samples. Such entropy separation criterion generalizes well during testing where the samples from the unseen classes exhibit higher entropy than the entropy of the samples from the seen classes. Subsequently, low and high entropy samples are classified using supervised learning and ZSL rather than GZSL. We show the superiority of the proposed method by experimenting on AWA1, CUB, HMDB51, and UCF101 datasets.

Abstract:
OCR-based image captioning is the task of automatically describing images based on reading and understanding written text contained in images. Compared to conventional image captioning, this task is more challenging, especially when the image contains multiple text tokens and visual objects. The difficulties originate from how to make full use of the knowledge contained in the textual entities to facilitate sentence generation and how to predict a text token based on the limited information provided by the image. Such problems are not yet fully investigated in existing research. In this paper, we present a novel design - Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. Specifically, the representations of text tokens and objects are fed into a three-layer LSTM captioner. Different attention scores for text tokens and objects are exploited through the multimodal attention network. Based on the attended features and the LSTM states, words are selected from the common vocabulary or from the image text by incorporating the learned spatial relationships between text tokens. Extensive experiments conducted on the TextCaps dataset verify the effectiveness of the proposed MMA-SR method. More remarkably, our MMA-SR increases CIDEr-D score from 93.7% to 98.0%.

Abstract:
Video captioning is a fundamental task for visual understanding. Previous works employ end-to-end networks to learn from the low-level vision feature and generate descriptive captions, which are hard to recognize fine-grained objects and lacks the understanding of crucial semantic concepts. According to DPC [19], these concepts generally present in the narrative transcripts of the instructional videos. The incorporation of transcript and video can improve the captioning performance. However, DPC directly concatenates the embedding of transcript with video features, which is incapable of fusing language and vision features effectively and leads to the temporal mis-alignment between transcript and video. This motivates us to 1) learn the semantic concepts explicitly and 2) design a temporal alignment mechanism to better align the video and transcript for the captioning task. In this paper, we start with an encoder-decoder backbone using transformer models. Firstly, we design a semantic concept prediction module as a multi-task to train the encoder in a supervised way. Then, we develop an attention based cross-modality temporal alignment method that combines the sequential video frames and transcript sentences. Finally, we adopt a copy mechanism to enable the decoder(generation) module to copy important concepts from source transcript directly. The extensive experimental results demonstrate the effectiveness of our model, which achieves state-of-the-art results on YouCookII dataset.

Abstract:
Social media popularity estimation refers to predict the post's popularity using multimodal contents. The prediction performance heavily relies on the feature extraction part and fully leveraging multimodal heterogeneous data is of a great challenge in the practical settings. Despite remarkable progress have been made, most of the previous attempts are restrained from the essentially limited property of the employed single modality. Inspired by the recent success of multimodal learning, we propose a novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities. Moreover, an attention mechanism is introduced in our framework, with the goal to assign large weights to specified modalities during the training and inference phases. To empirically investigate the effectiveness and robustness of the proposed approach, we conduct extensive experiments on the 2020 SMP challenge. The obtained results show that the proposed framework outperforms related approaches.

Abstract:
This paper presents our solution to ACM MM challenge: Large-scale Human-centric Video Analysis in Complex Events[13]; specifically, here we focus on Track3: Crowd Pose Tracking in Complex Events. Remarkable progress has been made in multi-pose training in recent years. However, how to track the human pose in crowded and complex environments has not been well addressed. We formulate the problem as several subproblems to be solved. First, we use a multi-object tracking method to assign human ID to each bounding box generated by the detection model. After that, a pose is generated to each bounding box with ID. At last, optical flow is used to take advantage of the temporal information in the videos and generate the final pose tracking result.

Abstract:
This paper tackles the challenging problem of multi-person articulated tracking in crowded scenes. We propose a simple yet effective top-down crowd pose tracking algorithm. The proposed method applies Cascade-RCNN for human detection and HRNet for pose estimation. Then IOU tracking and pose distance tracking are applied successively for pose tracking. We conduct extensive ablation studies on the recently released HiEve crowd pose tracking benchmark. Our final model achieves 56.98 Multi-Object Tracking Accuracy (MOTA) without model ensembling on the HiEve test set. Our team SimpleTrack won the 3rd place in the ACM MM'2020 HiEve Challenge.

Abstract:
Recent progress in few-shot segmentation usually aims at performing novel object segmentation using a few annotated examples as guidance. In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, i.e., Generalized Few-shot Scene Parsing (GFSP). In this task, we take a fully annotated image as guidance to segment all pixels in a query image. Our mission is to study a generalizable and robust segmentation network from the meta-learning perspective so that both seen and unseen categories can be correctly recognized. Different from previous practices, this task performs segmentation on a joint label space consisting of both previously seen and novel categories. Moreover, pixels from these multiple categories need to be simultaneously taken into account, which is actually not well explored before. Accordingly, we present Meta Parsing Networks (MPNet) to better exploit the guidance information in the support set. Our MPNet contains two basic modules, i.e., the Adaptive Deep Metric Learning (ADML) module and the Contrastive Inter-class Distraction (CID) module. Specially, the ADML takes the annotated pixels from the support image as the guidance and adaptively produces high-quality prototypes for learning a deep comparison metric. In addition, MPNet further introduces the CID module learning to enlarge the feature discrepancy of different categories in the embedding space, leading the MPNet to generate more discriminative feature embeddings. We conduct experiments on two newly constructed benchmarks, i.e., GFSP-Cityscapes and GFSP-Pascal-Context. Extensive ablation studies well demonstrate the effectiveness and generalization ability of our MPNet.

Abstract:
Learning on 3D scene-based point cloud has received extensive attention as its promising application in many fields, and well-annotated and multisource datasets can catalyze the development of those data-driven approaches. To facilitate the research of this area, we present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks and also an effective learning framework for its hierarchical segmentation task. The dataset was generated via the photogrammetric processing on unmanned aerial vehicle (UAV) images of the National University of Singapore (NUS) campus, and has been point-wisely annotated with both hierarchical and instance-based labels. Based on it, we formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies. To solve this problem, a two-stage method including multi-task (MT) learning and hierarchical ensemble (HE) with consistency consideration is proposed. Experimental results demonstrate the superiority of the proposed method and potential advantages of our hierarchical annotations. In addition, we benchmark results of semantic and instance segmentation, which is accessible online at https://3d.dataset.site with the dataset and all source codes.

Abstract:
Videos have data in multiple modalities, e.g., audio, video, text (captions). Understanding and modeling the interaction between different modalities is key for video analysis tasks like categorization, object detection, activity recognition, etc. However, data modalities are not always correlated --- so, learning when modalities are correlated and using that to guide the influence of one modality on the other is crucial. Another salient feature of videos is the coherence between successive frames due to continuity of video and audio, a property that we refer to as temporal coherence. We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multi-modal machine learning (ML) models for video analysis tasks like categorization. Our experiments on the large-scale YouTube-8M dataset show how our approach significantly outperforms state-of-the-art multi-modal ML models for video categorization. The model trained on the YouTube-8M dataset also showed good performance on an internal dataset of video segments from actual Samsung TV Plus channels without retraining or fine-tuning, showing the generalization capabilities of our model.

Abstract:
Through much exploration in the past decade, emotion analysis in conversations was mainly conducted in textual scenario. Nowadays, with the popularization of speech and video communication, academia and industry have become gradually aware of the need in multimodal scenario. Therefore, emotion detection in conversations becomes increasingly hot not only in natural language processing (NLP) community but also in multimodal analysis community. Although previous studies normally argue that the emotion of current utterance in a conversation is much influenced by the content of historical utterances, their speakers and emotions, they model the influence derived from the history to the current utterance at the same granularity (Intra-modal influence). Intuitively, the clues of emotion detection may not exist in the history of the same modality as current utterance, but in the history of other modalities (Inter-modal influence). Besides, previous studies normally model the information propagation as the conversation flow. Intuitively, bidirectional modeling of information propagation in conversations provides rich clues for emotion detection. Therefore, this paper proposes a bidirectional dynamic dual influence network for real-time emotion detection in conversations, which can simultaneously model both intra- and inter-modal influence with bidirectional information propagation for current utterance and its historical utterances. Detailed experiments demonstrate that our approach much advances the state-of-the-art.

Abstract:
The non-Euclidean geometry characteristic poses a challenge to the saliency prediction for 360-degree images. Since spherical data cannot be projected onto a single plane without distortion, existing saliency prediction methods based on traditional CNNs are inefficient. In this paper, we propose a saliency prediction framework for 360-degree images based on graph convolutional networks (SalGCN), which directly applies to the spherical graph signals. Specifically, we adopt the Geodesic ICOsahedral Pixelation (GICOPix) to construct a spherical graph signal from a spherical image in equirectangular projection (ERP) format. We then propose a graph saliency prediction network to directly extract the spherical features and generate the spherical graph saliency map, where we design an unpooling method suitable for spherical graph signals based on linear interpolation. The network training process is realized by modeling the node regression problem of the input and output spherical graph signals, where we further design a Kullback-Leibler (KL) divergence loss with sparse consistency to make the sparseness of the saliency map closer to the ground truth. Eventually, to obtain the ERP format saliency map for evaluation, we further propose a spherical crown-based (SCB) interpolation method to convert the output spherical graph saliency map into a saliency map in ERP format. Experiments show that our SalGCN can achieve comparable or even better saliency prediction performance both subjectively and objectively, with a much lower computation complexity.

Abstract:
Interpretability has become an essential topic as deep learning is widely applied in professional fields (e.g., medical image processing)where high level of accountability is required. Existing methods for explanation mainly focus on computing the importance of low level pixels or segments, rather than the high-level concepts. Concepts are of paramount importance for human to understand and make decisions, especially for those fine-grained tasks. In this paper, we focus on the real application problem of classification of infectious keratitis and propose a visual concept mining (VCM) method to explain the fine-grained infectious keratitis images. Based on our discovered explainable visual concepts, we further propose a visual concept enhanced framework for infectious keratitis classification. Extensive empirical experiments demonstrate that (i) our discovered visual concepts are highly coherent with the physicians? understanding and interpretation, and (ii) our visual concept enhanced model achieves significant improvement on the performance of infectious keratitis classification.

Abstract:
2D image-based 3D shape retrieval (2D-to-3D) investigates the problem of matching the relevant 3D shapes from gallery dataset when given a query image. Recently, adversarial training and environmental style transfer learning have been successful applied to this task and achieved state-of-the-art performance. However, there still exist two problems. First, previous works only concentrate on the connection between the label and representation, where the unique visual characteristics of each instance are paid less attention. Second, the confused features or the transformed images can only cheat the discriminator but can not guarantee the semantic consistency. In another words, features of 2D desk may be mapped nearby the features of 3D chair. In this paper, we propose a novel semantic consistency guided instance feature alignment network (SC-IFA) to address these limitations. SC-IFA mainly consists of two parts, instance visual feature extraction and cross-domain instance feature adaptation. For the first module, unlike previous methods, which merely employ 2D CNN to extract the feature, we additionally maximize the mutual information between the input and feature to enhance the capability of feature representation for each instance. For the second module, we first introduce the margin disparity discrepancy model to mix up the cross-domain features in an adversarial training way. Then, we design two feature translators to transform the feature from one domain to another domain, and impose the translation loss and correlation loss on the transformed features to preserve the semantic consistency. Extensive experimental results on two benchmarks, MI3DOR and MI3DOR-2, verify SC-IFA is superior to the state-of-the-art methods.

Abstract:
Multi-target multi-camera tracking (MTMCT), i.e., tracking multiple targets across multiple cameras, is a crucial technique for smart city applications. In this paper, we propose an effective and reliable MTMCT framework for vehicles, which consists of a traffic-aware single camera tracking (TSCT) algorithm, a trajectory-based camera link model (CLM) for vehicle re-identification (ReID), and a hierarchical clustering algorithm to obtain the cross camera vehicle trajectories. First, the TSCT, which jointly considers vehicle appearance, geometric features, and some common traffic scenarios, is proposed to track the vehicles in each camera separately. Second, the trajectory-based CLM is adopted to facilitate the relationship between each pair of adjacently connected cameras and add spatio-temporal constraints for the subsequent vehicle ReID with temporal attention. Third, the hierarchical clustering algorithm is used to merge the vehicle trajectories among all the cameras to obtain the final MTMCT results. Our proposed MTMCT is evaluated on the CityFlow dataset and achieves a new state-of-the-art performance with IDF1 of 74.93%.

Abstract:
A great number of deep learning based models have been recently proposed for automatic music composition. Among these models, the Transformer stands out as a prominent approach for generating expressive classical piano performance with a coherent structure of up to one minute. The model is powerful in that it learns abstractions of data on its own, without much human-imposed domain knowledge or constraints. In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. In particular, we seek to impose a metrical structure in the input data, so that Transformers can be more easily aware of the beat-bar-phrase hierarchical structure in music. The new data representation maintains the flexibility of local tempo changes, and provides hurdles to control the rhythmic and harmonic structure of music. With this approach, we build a Pop Music Transformer that composes Pop piano music with better rhythmic structure than existing Transformer models.

Abstract:
3D Semantic-Instance Segmentation (SIS) is a newly emerging research direction that aims to understand visual information of 3D scene on both semantic and instance level. The main difficulty lies in how to coordinate the paradox between mutual aid and sub-optimal problem. Previous methods usually address the mutual aid between instances and semantics by direct feature fusion or hand-crafted constraints to share the common knowledge of the two tasks. However, they neglect the abundant common knowledge of feature context in the feature space. Moreover, the direct feature fusion can raise the sub-optimal problem, since the false prediction of instance object can interfere the prediction of the semantic segmentation and vice versa. To address the above two issues, we propose a novel network of feature context fusion for SIS task, named CF-SIS. The idea is to associatively learn semantic and instance segmentation of 3D point clouds by context fusion with attention in the feature space. Our main contributions are two context fusion modules. First, we propose a novel inter-task context fusion module to take full advantage of mutual aid and relive the sub-optimal problem. It extracts the context in feature space from one task with attention, and selectively fuses the context into the other task using a gate fusion mechanism. Then, in order to enhance the mutual aid effect, the intra-task context fusion module is designed to further integrate the fused context, by selectively merging the similar feature through the self-attention mechanism. We conduct experiments on the S3DIS and ShapeNet datasets and show that CF-SIS outperforms the state-of-the-art methods on semantic and instance segmentation task.

Abstract:
The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action labels are given without the timestamp annotation on when the actions occur. The main reason comes from that, the weakly-supervised networks only focus on the highly discriminative frames, but there are some ambiguous frames in both background and action classes. The ambiguous frames in background class are very similar to the real actions, which may be treated as target actions and result in false positives. On the other hand, the ambiguous frames in action class which possibly contain action instances, are prone to be false negatives by the weakly-supervised networks and result in a coarse localization. To solve these problems, we introduce a novel weakly-supervised Action Completeness Modeling with Background Aware Networks (ACM-BANets). Our Background Aware Network (BANet) contains a weight-sharing two-branch architecture, with an action guided Background aware Temporal Attention Module (B-TAM) and an asymmetrical training strategy, to suppress both highly discriminative and ambiguous background frames to remove the false positives. Our action completeness modeling contains multiple BANets, and the BANets are forced to discover different but complementary action instances to completely localize the action instances in both highly discriminative and ambiguous action frames. In the i-th iteration, the i-th BANet discovers the discriminative features, which are then erased from the feature map. The partially-erased feature map is fed into the (i+1)-th BANet of the next iteration to force this BANet to discover discriminative features different from the i-th BANet. Evaluated on two challenging untrimmed video datasets, THUMOS14 and ActivityNet1.3, our approach outperforms all the current weakly-supervised methods for temporal action localization.

Abstract:
Human pose estimation has been widely studied with much focus on supervised learning requiring sufficient annotations. However, in real applications, a pretrained pose estimation model usually need be adapted to a novel domain with no labels or sparse labels. Such domain adaptation for 2D pose estimation hasn't been explored. The main reason is that a pose, by nature, has typical topological structure and needs fine-grained features in local keypoints. While existing adaptation methods do not consider topological structure of object-of-interest and they align the whole images coarsely. Therefore, we propose a novel domain adaptation method for multi-person pose estimation to conduct the human-level topological structure alignment and fine-grained feature alignment. Our method consists of three modules: Cross-Attentive Feature Alignment (CAFA), Intra-domain Structure Adaptation (ISA) and Inter-domain Human-Topology Alignment (IHTA) module. The CAFA adopts a bidirectional spatial attention module (BSAM) that focuses on fine-grained local feature correlation between two humans to adaptively aggregate consistent features for adaptation. We adopt ISA only in semi-supervised domain adaptation (SSDA) to exploit the corresponding keypoint semantic relationship for reducing the intra-domain bias. Most importantly, we propose an IHTA to learn more domain-invariant human topological representation for reducing the inter-domain discrepancy. We model the human topological structure via the graph convolution network (GCN), by passing messages on which, high-order relations can be considered. This structure preserving alignment based on GCN is beneficial to the occluded or extreme pose inference. Extensive experiments are conducted on two popular benchmarks and results demonstrate the competency of our method compared with existing supervised approaches.

Abstract:
Compared to a single fixed camera, multiple moving cameras, e.g., those worn by people, can better capture the human interactive and group activities in a scene, by providing multiple, flexible and possibly complementary views of the involved people. In this setting the actual promotion of activity detection is highly dependent on the effective correlation and collaborative analysis of multiple videos taken by different wearable cameras, which is highly challenging given the time-varying view differences across different cameras and mutual occlusion of people in each video. By focusing on two wearable cameras and the interactive activities that involve only two people, in this paper we develop a new approach that can simultaneously: (i) identify the same persons across the two videos, (ii) detect the interactive activities of interest, including their occurrence intervals and involved people, and (iii) recognize the category of each interactive activity. Specifically, we represent each video by a graph, with detected persons as nodes, and propose a unified Graph Neural Network (GNN) based framework to jointly solve the above three problems. A graph matching network is developed for identifying the same persons across the two videos and a graph inference network is then used for detecting the human interactions. We also build a new video dataset, which provides a benchmark for this study, and conduct extensive experiments to validate the effectiveness and superiority of the proposed method.

Abstract:
Current work of facial landmark tracking usually requires large amounts of fully annotated facial videos to train a landmark tracker. To relieve the burden of manual annotations, we propose a novel facial landmark tracking method that makes full use of unlabeled facial videos by exploiting both self-supervised and semi-supervised learning mechanisms. First, self-supervised learning is adopted for representation learning from unlabeled facial videos. Specifically, a facial video and its shuffled version are fed into a feature encoder and a classifier. The feature encoder is used to learn visual representations, and the classifier distinguishes the input videos as the original or the shuffled ones. The feature encoder and the classifier are trained jointly. Through self-supervised learning, the spatial and temporal patterns of a facial video are captured at representation level. After that, the facial landmark tracker, consisting of the pre-trained feature encoder and a regressor, is trained semi-supervisedly. The consistencies among the tracking results of the original, the inverse and the disturbed facial sequences are exploited as the constraints on the unlabeled facial videos, and the supervised loss is adopted for the labeled videos. Through semi-supervised end-to-end training, the tracker captures sequential patterns inherent in facial videos despite small amount of manual annotations. Experiments on two benchmark datasets show that the proposed framework outperforms state-of-the-art semi-supervised facial landmark tracking methods, and also achieves advanced performance compared to fully supervised facial landmark tracking methods.

Abstract:
Person re-identification (ReID) has recently received extensive research interests due to its diverse applications in multimedia analysis and computer vision. However, the majority of existing works focus on improving matching accuracy, while ignoring matching efficiency. In this work, we present a novel binary representation learning framework for efficient person ReID, namely Deep Local Binary Coding (DLBC). Different from existing deep binary ReID approaches, DLBC attempts to learn discriminative binary codes by explicitly interacting with local visual details. Specifically, DLBC first extracts a set of local features from spatially salient regions of pedestrian images. Subsequently, DLBC formulates a new binary-local semantic mutual information (BSMI) maximization term, based on which a self-lifting (SL) block is built to further exploit the semantic importance of local features. The BSMI term together with the SL block simultaneously enhances the dependency of binary codes on selected local features as well as their robustness to cross-view visual inconsistency. In addition, an efficient optimizing method is developed to train the proposed deep models with orthogonal and binary constraints. Extensive experiments reveal that DLBC significantly minimizes the accuracy gap between binary ReID methods and the state-of-the-art real-valued ones, whilst remarkably reducing query time and memory cost.

Abstract:
Significant progress has been made in semantic segmentation by deep neural networks, most of which concentrate on discriminative representation learning. However, model performances suffer from deterioration when the training process is optimized without awareness of data imperfections (e.g., data imbalance and label noise). In contrast to previous works, we present a novel model-agnostic training optimization algorithm which has two prominent components: Domain Division and Domain Generalization. Rather than sampling all pixels uniformly, an uncertainty-based Domain Division method is proposed to deal with data imbalance, which dynamically decomposes the pixels into meta-train and meta-test domains according to whether they lie near the classification boundary. The meta-train domain corresponds to highly-uncertain but more informative pixels and determines the current main update direction. Furthermore, to alleviate the degradation caused by label noise, we propose a Domain Generalization technique with a meta-optimization objective which ensures that update on the meta-train domain should generalize to the meta-test domain. Comprehensive experimental results on three public benchmarks across multi-modalities show that the proposed optimization algorithm is superior to other segmentation optimization methods and significantly outperforms conventional methods without introducing additional model parameters.

Abstract:
Music-to-visual style transfer is a challenging yet important cross-modal learning problem in the practice of creativity. Its major difference from the traditional image style transfer problem is that the style information is provided by music rather than images. Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the music-to-visual style transfer problem in two steps: music visualization and style transfer. The music visualization network utilizes an encoder-generator architecture with a conditional generative adversarial network to generate image-based music representations from music data. This network is integrated with an image style transfer method to accomplish the style transfer process. Experiments are conducted on WikiArt-IMSLP, a newly compiled dataset including Western music recordings and paintings listed by decades. By utilizing such a label to learn the semantic connection between paintings and music, we demonstrate that the proposed framework can generate diverse image style representations from a music piece, and these representations can unveil certain art forms of the same era. Subjective testing results also emphasize the role of the era label in improving the perceptual quality on the compatibility between music and visual content.

Abstract:
The past decade has witnessed the explosive growth of faces in video multimedia systems, e.g., videoconferencing and live shows. However, these videos are normally compressed at low bit-rates due to the bandwidth-hungry issue, leading to heavy quality degradation on face regions. This paper addresses the problem of face quality enhancement in compressed videos. Specifically, we establish a compressed face video (CFV) database, which includes 87,607 faces in 113 raw video sequences and their corresponding 904 compressed sequences. We find that the faces of compressed videos exhibit tremendous scale variation and quality fluctuation. Motivated by scalable video coding, we propose a multi-scale recurrent scalable network (MRS-Net) to enhance the quality of multi-scale faces in compressed videos. The MRS-Net is comprised by one base and two refined enhancement levels, corresponding to the quality enhancement of small-, medium- and large-scale faces, respectively. In the multi-level architecture of our MRS-Net, small-/medium-scale face quality enhancement serves as the basis for facilitating the quality enhancement of medium-/large-scale faces. Finally, experimental results show that our MRS-Net method is effective in enhancing the quality of multi-scale faces for compressed videos, significantly outperforming other state-of-the-art methods.

Abstract:
Zero-shot learning (ZSL) is commonly used to address the very pervasive problem of predicting unseen classes in fine-grained image classification and other tasks. One family of solutions is to learn synthesised unseen visual samples produced by generative models from auxiliary semantic information, such as natural language descriptions. However, for most of these models, performance suffers from noise in the form of irrelevant image backgrounds. Further, most methods do not allocate a calculated weight to each semantic patch. Yet, in the real world, the discriminative power of features can be quantified and directly leveraged to improve accuracy and reduce computational complexity. To address these issues, we propose a novel framework called multi-patch generative adversarial nets (MPGAN) that synthesises local patch features and labels unseen classes with a novel weighted voting strategy. The process begins by generating discriminative visual features from noisy text descriptions for a set of predefined local patches using multiple specialist generative models. The features synthesised from each patch for unseen classes are then used to construct an ensemble of diverse supervised classifiers, each corresponding to one local patch. A voting strategy averages the probability distributions output from the classifiers and, given that some patches are more discriminative than others, a discrimination-based attention mechanism helps to weight each patch accordingly. Extensive experiments show that MPGAN has significantly greater accuracy than state-of-the-art methods.

Abstract:
In recent years, the development of devices for acquisition and rendering of 3D contents have facilitated the diffusion of immersive virtual reality experiences. In particular, the point cloud representation has emerged as a popular format for volumetric photorealistic reconstructions of dynamic real world objects, due to its simplicity and versatility. To optimize the delivery of the large amount of data needed to provide these experiences, adaptive streaming over HTTP is a promising solution. In order to ensure the best quality of experience within the bandwidth constraints, adaptive streaming is combined with tiling to optimize the quality of what is being visualized by the user at a given moment; as such, it has been successfully used in the past for omnidirectional contents. However, its adoption to the point cloud streaming scenario has only been studied to optimize multi-object delivery. In this paper, we present a low-complexity tiling approach to perform adaptive streaming of point cloud content. Tiles are defined by segmenting each point cloud object in several parts, which are then independently encoded. In order to evaluate the approach, we first collect real navigation paths, obtained through a user study in 6 degrees of freedom with 26 participants. The variation in movements and interaction behaviour among users indicate that a user-centered adaptive delivery could lead to sensible gains in terms of perceived quality. Evaluation of the performance of the proposed tiling approach against state of the art solutions for point cloud compression, performed on the collected navigation paths, confirms that considerable gains can be obtained by exploiting user-adaptive streaming, achieving bitrate gains up to 57% with respect to a non-adaptive approach with the same codec. Moreover, we demonstrate that the selection of navigation data has an impact on the relative objective scores.

Abstract:
In the fashion industry where social media has a growing presence, it is increasingly important to find the emergence of people's new tastes in the early stage based on the photos posted there. However, the amount of photos posted on fashion social media is so large that it is almost impossible for people to examine them manually. Also, previous studies on image analysis in social media focus only on individual items for trend detection. Therefore, in this research, we propose a novel framework for capturing changes in people's tastes in terms of coordination rather than individual items. In the framework, we apply Emerging Topic Detection (ETD) to multiple meta-data of images automatically extracted by deep learning. In ETD, new topics which did not exist previously are detected by comparing multiple time windows. To better capture the nature of fashion topics, we employ a clustering method MULIC as a topic detection method, which is density-based, centroid-based, and designed for categorical data. Our experiments with real-world data, in terms of method stability, qualitative evaluation of the output, and experts review, confirmed that the Emerging Topics were properly captured.

Abstract:
Zero-shot image segmentation refers to the task of segmenting pixels from specific unseen semantic class. Previous methods mainly rely on historic segmentation tasks, such as using semantic embedding or word embedding of class names to infer a new segmentation model. In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. As our main insight, image captions often implicitly entail the occurrence of a new class in an image and its most-confident spatial distribution. We define a contextual entailment question (CEQ) that tailors BERT-like text models. In specific, the proposed networks for inferring unseen classes consists of three branches (global / local / semi-global), which infer labels of unseen class from image level, image-stripe level or pixel level respectively. Comprehensive experiments and ablation studies are conducted on two image benchmarks, COCO-stuff and Pascal VOC. All clearly demonstrate the effectiveness of the proposed Cap2Seg, including a set of hardest unseen classes (i.e., image captions do not literally contain the class names and direct matching for inference fails).

Abstract:
Image captioning has attracted extensive research interests in recent years. Due to the great disparities between vision and language, an important goal of image captioning is to link the information in visual domain to textual domain. However, many approaches conduct this process only in the decoder, making it hard to understand the images and generate captions effectively. In this paper, we propose to bridge the gap between the vision and language domains in the encoder, by enriching visual information with textual concepts, to achieve deep image understandings. To this end, we propose to explore the textual-enriched image features. Specifically, we introduce two modules, namely Textual Distilling Module and Textual Association Module. The former distills relevant textual concepts from image features, while the latter further associates extracted concepts according to their semantics. In this manner, we acquire textual-enriched image features, which provide clear textual representations of image under no explicit supervision. The proposed approach can be used as a plugin and easily embedded into a wide range of existing image captioning systems. We conduct the extensive experiments on two benchmark image captioning datasets, i.e., MSCOCO and Flickr30k. The experimental results and analysis show that, by incorporating the proposed approach, all baseline models receive consistent improvements over all metrics, with the most significant improvement up to 10% and 9%, in terms of the task-specific metrics CIDEr and SPICE, respectively. The results demonstrate that our approach is effective and generalizes well to a wide range of models for image captioning.

Abstract:
Participation on social media platforms has many benefits but also poses substantial threats. Users often face an unintended loss of privacy, are bombarded with mis-/disinformation, or are trapped in filter bubbles due to over-personalized content. These threats are further exacerbated by the rise of hidden AI-driven algorithms working behind the scenes to shape users' thoughts, attitudes, and behaviour. We investigate how multimedia researchers can help tackle these problems to level the playing field for social media users. We perform a comprehensive survey of algorithmic threats on social media and use it as a lens to set a challenging but important research agenda for effective and real-time user nudging. We further implement a conceptual prototype and evaluate it with experts to supplement our research agenda. This paper calls for solutions that combat the algorithmic threats on social media by utilizing machine learning and multimedia content analysis techniques but in a transparent manner and for the benefit of the users.

Abstract:
This companion paper supports the experimental replication of paper "Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network", which is presented at ACM Multimedia 2019. We provide the software package for replicating the implementation of Multi-Layered Comparison Network (MCN), as well as the Polyvore-T dataset and baseline methods compared in the original paper. This paper contains the guides to reproduce the experiment results including outfit compatibility prediction, outfit diagnosis and automatic outfit revision.

Abstract:
This work presents a so-called Smart Site Survey (SSS) system that provides an efficient, web-based platform for virtual inspection of remote sites with absolute 3D metrics. Traditional manual surveying requires sending surveyors and specialised measuring tools to the targeted scene, which takes time and requires significant human resource, and often includes human error. The proposed system provides an automated site survey tool. Sample indoor scenes including offices, storage rooms, and laboratory are used for testing purposes, and highly precise virtual scenes are restored, with the measurement accuracy of 1%, i.e. an error ±1.5cm to a 150cm length. This is comparable or superior to existing works or commercial products.

Abstract:
Anomaly detection in surveillance videos, as a special case of video-based action recognition, has been of increasing interest in multimedia community and public security. Action recognition in videos faces some challenges, such as cluttered background, illumination conditions. Besides these above difficulties, detecting anomaly in surveillance videos has several unique problems to be solved. For example, the lack of sufficient training samples is one of the main challenges for detecting anomalies in surveillance videos. In this paper, we propose to utilize transfer learning to leverage the good results from action recognition for anomaly detection in surveillance videos. More specially, we explore some techniques based on action recognition models from the following aspects: training samples, temporal modules for action recognition, network backbones. We draw some conclusions. First, more training samples from surveillance videos lead to higher classification accuracy. Second, stronger temporal modules designed for recognizing action and deeper networks do not achieve better results. This conclusion is reasonable since deeper networks tend to over-fitting, especially for the small-scale training set. Besides, to distinguish the hard examples from normal activities, we separately train a neural network to classify the hard category and normal events. Then we fuse the binary network and previous network to generate the final prediction for general anomaly detection. On the benchmarks of CitySCENE, our framework achieves promising performance and obtains the first prize for general anomaly detection and the second prize for specific anomaly detection.

Abstract:
Beauty and Personal care product retrieval has attracted more and more attention due to its wide application value. However, due to the diversity of data and the complexity of image background, this task is very challenging. In this paper, we propose a multi-feature fusion method based on salient object detection to improve retrieval performance. The key of our method is to extract the foreground objects of the query set by using the salient object detection network, so as to eliminate the background interference. Then the foreground target images and dataset are put into the multi-classification networks to extract multiple fusion features for retrieval. We use the perfect-500k dataset for experiments, and the results show that our method is effective. Our method ranked 2st in the Grand Challenge of AI Meets Beauty in ACM Multimedia 2020 with a MAP score of 0.43729. We released our code on GitHub:github.com/R-M-Yan/ACMMM2020AIMeetBeauty.

Abstract:
The application of beauty and personal-care product retrieval seems to be evident in our daily life, and it has attracted increasing research interests during the last decade. However, the retrieval task is suffered from different image variations and complicated backgrounds. Recent works have demonstrated that Generalized-attention Regional Maximal Activation of Convolutions (GRMAC) descriptor can provide state-of-the-art performance for the retrieval task. However, GRMAC descriptor is restrained from the essentially limited property of the employed feature from a single layer. Features from a single layer are not robust enough for scale variations, shape deformation, and heavy occlusion. In this paper, we propose a novel descriptors, named Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions (MS-GRMAC). This method introduces multi-scale generalized attention mechanism to reduce the influence of scale variations, thus, can boost the performance of the retrieval task. To empirically investigate the effectiveness of the proposed approach, we conduct extensive experiments on the dataset containing more than half-million personal-care products (Perfect-500K) and obtain satisfactory results without ensemble.

Abstract:
Nowadays, people spend dramatically more time on watching videos through different devices. The advanced hardware technology and network allow for the increasing demands of users viewing experience. Thus, enhancing the Quality of Experience of end-users in advanced multimedia is the ultimate goal of service providers, as good services would attract more consumers. Quality assessment is thus important. The first workshop on "Quality of Experience (QoE) in visual multimedia applications" (QoEVMA'20) focuses on the QoE assessment of any visual multimedia applications both subjectively and objectively. The topics include 1)QoE assessment on different visual multimedia applications, including VoD for movies, dramas, variety shows, UGC on social networks, live streaming videos for gaming/shopping/social, etc. 2)QoE assessment for different video formats in multimedia services, including 2D, stereoscopic 3D, High Dynamic Range (HDR), Augmented Reality (AR), Virtual Reality (VR), 360, Free-Viewpoint Video(FVV), etc. 3)Key performance indicators (KPI) analysis for QoE. This summary gives a brief overview of the workshop, which took place at October 16, 2020 in Seattle (U.S.), as a half-day workshop.

Abstract:
This tutorial provides an actionable perspective on the experimental design for machine learning experiments on multimedia data. The tutorial consists of lectures and hands-on exercises. The lectures provide an engineering introduction to machine learning design. By understanding the information flow and quantities in the scientific process, machine learners can be designed to be more efficient and their limits can be easier understood. The thought framework presented is derived from the traditional experimental sciences which require published results to be self-contained with regards to reproducibility. In the practical exercises, we will work on calculating and measuring quantities like Memory Equivalent Capacity or generalization ratio for different machine learners and data sets and discuss how these quantities relate to reproducible experimental design.

Abstract:
In the past few years, Cloud Drive Apps have aroused increasing interest from end-users and enterprise customers. During this period, numerous artificial intelligence based features were introduced, such as functions enabling users to intelligently organize, search, share, edit and recreate content with their images and videos. In this talk, I will introduce our latest work related to highly-efficient image understanding, which aims to enable various novel methods (such as neural architecture search [1,2] and advanced training techniques [3,4]) to be practiced in Cloud Drive App use cases. I will discuss use-cases such as image search through free-text query, focusing on difficult real-world problems and suggested solutions. I will also demonstrate the usefulness of the proposed techniques when applied to public competitions.

Abstract:
Webly supervised learning becomes attractive recently for its efficiency in data expansion without expensive human labeling. However, adopting search queries or hashtags as web labels of images for training brings massive noise that degrades the performance of DNNs. Especially, due to the semantic confusion of query words, the images retrieved by one query may contain tremendous images belonging to other concepts. For example, searching 'tiger cat' on Flickr will return a dominating number of tiger images rather than the cat images. These realistic noisy samples usually have clear visual semantic clusters in the visual space that mislead DNNs from learning accurate semantic labels. To correct real-world noisy labels, expensive human annotations seem indispensable. Fortunately, we find that metadata can provide extra knowledge to discover clean web labels in a labor-free fashion, making it feasible to automatically provide correct semantic guidance among the massive label-noisy web data. In this paper, we propose an automatic label corrector VSGraph-LC based on the visual-semantic graph. VSGraph-LC starts from anchor selection referring to the semantic similarity between metadata and correct label concepts, and then propagates correct labels from anchors on a visual graph using graph neural network (GNN). Experiments on realistic webly supervised learning datasets Webvision-1000 and NUS-81-Web show the effectiveness and robustness of VSGraph-LC. Moreover, VSGraph-LC reveals its advantage on the open-set validation set.

Abstract:
Learning from music to visual storytelling of shots is an interesting and emerging task. It produces a coherent visual story in the form of a shot type sequence, which not only expands the storytelling potential for a song but also facilitates automatic concert video mashup process and storyboard generation. In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. Different from the one-way transfer between a pre-trained teacher network (or ensemble network) and a student network in knowledge distillation (KD), the proposed method enables collaborative learning between an ensemble teacher network and a student network. Namely, the student network also teaches. Specifically, our method first learns a teacher network that is composed of several assistant networks to generate a shot type sequence and produce the soft target (shot types) distribution accordingly through KD. It then constructs the student network that learns from both the ground truth label (hard target) and the soft target distribution to alleviate the difficulty of optimization and improve generalization capability. As the student network gradually advances, it turns to feed back knowledge to the assistant networks, thereby improving the teacher network in each iteration. Owing to such interactive designs, the DIL mechanism bridges the gap between the teacher and student networks and produces more superior capability for both networks. Objective and subjective experimental results demonstrate that both the teacher and student networks can generate more attractive shot sequences from music, thereby enhancing the viewing and listening experience.

Abstract:
Food recognition has received more and more attention in the multimedia community for its various real-world applications, such as diet management and self-service restaurants. A large-scale ontology of food images is urgently needed for developing advanced large-scale food recognition algorithms, as well as for providing the benchmark dataset for such algorithms. To encourage further progress in food recognition, we introduce the dataset ISIA Food-500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. Furthermore, we propose a stacked global-local attention network, which consists of two sub-networks for food recognition. One sub-network first utilizes hybrid spatial-channel attention to extract more discriminative features, and then aggregates these multi-scale discriminative features from multiple layers into global-level representation (e.g., texture and shape information about food). The other one generates attentional regions (e.g., ingredient relevant regions) from different regions via cascaded spatial transformers, and further aggregates these multi-scale regional features from different layers into local-level representation. These two types of features are finally fused as comprehensive representation for food recognition. Extensive experiments on ISIA Food-500 and other two popular benchmark datasets demonstrate the effectiveness of our proposed method, and thus can be considered as one strong baseline. The dataset, code and models can be found at http://123.57.42.89/FoodComputing-Dataset/ISIA-Food500.html.

Abstract:
As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD

Abstract:
Learning subtle discriminative features plays a significant role in fine-grained image classification. Existing methods usually extract the distinguishable parts through the attention module for classification. Although these learned distinguishable parts contain valuable features that are beneficial for classification, part of irrelevant features are also preserved, which may confuse the model to make a correct classification, especially for the fine-grained tasks due to their similarities. How to keep the discriminative features while removing confusable features from the distinguishable parts is an interesting yet changeling task. In this paper, we introduce a novel classification approach, named Logical-based Feature Extraction Model (LAFE for short) to address this issue. The main advantage of LAFE lies in the fact that it can explicitly add the significance of discriminative features and subtract the confusable features. Specifically, LAFE utilizes the region attention modules and channel attention modules to extract discriminative features and confusable features respectively. Based on this, two novel loss functions are designed to automatically induce attention over these features for fine-grained image classification. Our approach demonstrates its robustness, efficiency, and state-of-the-art performance on three benchmark datasets.

Abstract:
The standard paradigm of video super-resolution (SR) is to generate the spatial-temporal coherent high-resolution (HR) sequence from the corresponding low-resolution (LR) version which has already been decoded from the bitstream. However, a highly practical while relatively under-studied way is enabling the built-in SR functionality in the decoder, in the sense that almost all videos are compactly represented. In this paper, we systematically investigate the SR of compressed LR videos by leveraging the interactivity between decoding prior and deep prior. By fully exploiting the compact video stream information, the proposed bitstream prior embedded SR framework achieves compressed video SR and quality enhancement simultaneously in a single feed-forward process. More specifically, we propose a motion vector guided multi-scale local attention module that explicitly exploits the temporal dependency and suppresses coding artifacts with substantially economized computational complexity. Moreover, a scale-wise deep residual-in-residual network is learned to reconstruct the SR frames from the multi-scale fused features. To facilitate the research of compressed video SR, we also build a large-scale dataset with compressed videos of diverse content, including ready-made diversified kinds of side information extracted from the bitstream. Both quantitative and qualitative evaluations show that our model achieves superior performance for compressed video SR, and offers competitive performance compared to the sequential combinations of the state-of-the-art methods for compressed video artifacts removal and SR.

Abstract:
Temporal grounding of natural language in untrimmed videos is a fundamental yet challenging multimedia task facilitating cross-media visual content retrieval. We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary, which is more consistent with reality as such weak labels are more readily available in practice. In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. To the best of our knowledge, we offer the first attempt to extend RL to temporal localization task with weak supervision. As it is non-trivial to obtain a straightforward reward function in the absence of pairwise granular boundary-query annotations, a cross-modal alignment evaluator is crafted to measure the alignment degree of segment-query pair to provide tailor-designed rewards. This refinement scheme completely abandons traditional sliding window based solution pattern and contributes to acquiring more efficient, boundary-flexible and content-aware grounding results. Extensive experiments on two public benchmarks Charades-STA and ActivityNet demonstrate that BAR outperforms the state-of-the-art weakly-supervised method and even beats some competitive fully-supervised ones.

Abstract:
One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the State-Of-The-Art (SOTA) models of this task tends to be exceedingly sophisticated and over-parameterized, where the low efficiency in model training and inference has obstructed the development in the field, especially for large-scale action datasets. In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. Firstly, an MIB is designed to enrich informative skeleton features and remain compact representations at an early fusion stage. Then, inspired by the success of the ResNet architecture in Convolutional Neural Network (CNN), a ResGCN module is introduced in GCN to alleviate computational costs and reduce learning difficulties in model training while maintain the model accuracy. Finally, a PartAtt block is proposed to discover the most essential body parts over a whole action sequence and obtain more explainable representations for different skeleton action sequences. Extensive experiments on two large-scale datasets, i.e., NTU RGB+D 60 and 120, validate that the proposed baseline slightly outperforms other SOTA models and meanwhile requires much fewer parameters during training and inference procedures, e.g., at most 34 times less than DGNN, which is one of the best SOTA methods.

Abstract:
In this paper, we propose a new method for human parsing, which effectively maintains high-resolution representations and leverages body edge details to improve the performance. First, we propose a hybrid resolution network (HyRN) for human parsing and body edge detection. In our HyRN, we adopt deconvolution operation and auxiliary supervision to increase the discrimination ability of features from each scale. Second, considering the close relationship between human parsing and body edge detection, we propose a dual-task cascaded framework (DTCF), which implicitly integrates parsing and edge features to progressively refine the parsing results. Third, we develop an edge guided region mutual information loss, which uses the edge detection results to explicitly maintain the high order consistency between parsing prediction and ground truth around body edge pixels. When evaluated on standard benchmarks, our proposed HyRN achieves competitive accuracy compared with state-of-the-art human parsing methods. Moreover, our DTCF further improves the performance and outperforms the established baseline approach by 3.42 points w.t.r mIoU on the LIP dataset.

Abstract:
Live video interactive commenting, a.k.a. danmaku, is an emerging social feature on online video sites, which involves rich multimodal information interaction among viewers. In order to support various related research, we build a large scale video interactive comments dataset called VideoIC, which consists of 4951 videos spanning 557 hours and 5 million comments. Videos are collected from popular categories on the 'Bilibili' video streaming website. Comparing to other existing danmaku datasets, our VideoIC contains richer and denser comments information, with 1077 comments per video on average. High comment density and diverse video types make VideoIC a challenging corpus for various research such as automatic video comments generation. We also propose a novel model based on multimodal multitask learning for comment generation (MML-CG), which integrates multiple modalities to achieve effective comment generation and temporal relation prediction. A multitask loss function is designed to train both tasks jointly in the end-to-end manner. We conduct extensive experiments on both VideoIC and Livebot datasets. The results prove the effectiveness of our model and reveal some features of danmaku.

Abstract:
Visual data collected from Unmanned Aerial Vehicles (UAVs) has opened a new frontier of computer vision that requires automated analysis of aerial images/videos. However, the existing UAV datasets primarily focus on object detection. An object detector does not differentiate between the moving and non-moving objects. Given a real-time UAV video stream, how can we both localize and classify the moving objects, i.e. perform moving object recognition (MOR) The MOR is one of the essential tasks to support various UAV vision-based applications including aerial surveillance, search and rescue, event recognition, urban and rural scene understanding.To the best of our knowledge, no labeled dataset is available for MOR evaluation in UAV videos. Therefore, in this paper, we introduce MOR-UAV, a large-scale video dataset for MOR in aerial videos. We achieve this by labeling axis-aligned bounding boxes for moving objects which requires less computational resources than producing pixel-level estimates. We annotate 89,783 moving object instances collected from 30 UAV videos, consisting of 10,948 frames in various scenarios such as weather conditions, occlusion, changing flying altitude and multiple camera views. We assigned the labels for two categories of vehicles (car and heavy vehicle). Furthermore, we propose a deep unified framework MOR-UAVNet for MOR in UAV videos. Since, this is a first attempt for MOR in UAV videos, we present 16 baseline results based on the proposed framework over the MOR-UAV dataset through quantitative and qualitative experiments. We also analyze the motion-salient regions in the network through multiple layer visualizations. The MOR-UAVNet works online at inference as it requires only few past frames. Moreover, it doesn't require predefined target initialization from user. Experiments also demonstrate that the MOR-UAV dataset is quite challenging.

Abstract:
Recently, generative adversarial networks (GAN) have been widely used to solve image-to-image translation problems such as edges to photos, labels to scenes, and colorizing grayscale images. However, how to recover details of smoothed images is still unexplored. Naively training a GAN like pix2pix causes insufficiently perfect results due to the fact that we ignore two main characteristics including spatial variability and spatial correlation as for this problem. In this work, we propose DeSmoothGAN to utilize both characteristics specifically. The spatial variability indicates that the details of different areas of smoothed images are distinct and they are supposed to be recovered differently. Therefore, we propose to perform spatial feature-wise transformation to recover individual areas differently. The spatial correlation represents that the details of different areas are related to each other. Thus, we propose to apply full attention to consider the relations between them. The proposed method generates satisfying results on several real-world datasets. We have conducted quantitative experiments including smooth consistency and image similarity to demonstrate the effectiveness of DeSmoothGAN. Furthermore, ablation studies are performed to illustrate the usefulness of our proposed feature-wise transformation and full attention.

Abstract:
In the big data era, with the increasing amount of multi-media data, approximate nearest neighbor~(ANN) search has been an important but challenging problem. As a widely applied large-scale ANN search method, hashing has made great progress, and achieved sub-linear search time with low memory space. However, the advances in hashing are based on the availability of large and representative datasets, which often contain sensitive information. Typically, the privacy of this individually sensitive information is compromised. In this paper, we tackle this valuable yet challenging problem and formulate a task termed as private hashing, which takes into account both searching performance and privacy protection. Specifically, we propose a novel noise mechanism, i.e., Random Flipping, and two private hashing algorithms, i.e., PHashing and PITQ, with the refined analysis within the framework of differential privacy, since differential privacy is a well-established technique to measure the privacy leakage of an algorithm. Random Flipping targets binary scenarios and leverages the "Imperceptible Lying" idea to guarantee ε-differential privacy by flipping each datum of the binary matrix (noise addition). To preserve ε-differential privacy, PHashing perturbs and adds noise to the hash codes learned by non-private hashing algorithms using Random Flipping. However, the noise addition for privacy in PHashing will cause severe performance drops. To alleviate this problem, PITQ leverages the power of alternative learning to distribute the noise generated by Random Flipping into each iteration while preserving ε-differential privacy. Furthermore, to empirically evaluate our algorithms, we conduct comprehensive experiments on the image search task and demonstrate that proposed algorithms achieve equal performance compared with non-private hashing methods.

Abstract:
Facial action unit (AU) intensity is an index to describe all visually discernible facial movements. Most existing methods learn intensity estimator with limited AU data, while they lack of generalization ability out of the dataset. In this paper, we present a framework to predict the facial parameters (including identity parameters and AU parameters) based on a bone-driven face model (BDFM) under different views. The proposed framework consists of a feature extractor, a generator, and a facial parameter regressor. The regressor can fit the physical meaning parameters of the BDFM from a single face image with the help of the generator, which maps the facial parameters to the game-face images as a differentiable renderer. Besides, identity loss, loopback loss, and adversarial loss can improve the regressive results. Quantitative evaluations are performed on two public databases BP4D and DISFA, which demonstrates that the proposed method can achieve comparable or better performance than the state-of-the-art methods. What's more, the qualitative results also demonstrate the validity of our method in the wild.

Abstract:
Assessing individual's personality traits has important implications in psychology, sociology, and economics. Conventional personality measurement methods were questionnaire-based, which are time-consuming and manpower-expensive. With the pervasive deployment of mobile communication applications, smartphone usage data was found to relate to people's social behavioral and psychological aspects. In this paper, we propose a deep learning approach to infer people's Big Five personality traits based on smartphone data. Specifically, we collect smartphone usage snapshots with an Android App, and extract features from the collected data. We propose a multi-view multi-task learning approach with a deep neural network model to fuse the extracted features and learn the Big Five personality traits jointly. Extensive experiments based on the real-world smartphone data collected from university volunteers show that the proposed approach significantly outperforms the state-of-the-art algorithms in personality prediction.

Abstract:
Affective media videos have been used as stimulus to investigate an individual's affective-physio responses. In this study, we aim to develop a network learning strategy for robust cross-corpus emotion recognition using physiological features jointly with affective video content. Specifically, we present a novel framework of Visual Semantic Graph Learning Convolutional Network (VGLCN) for individual emotional state recognition using physiology on transfer learning tasks. The stimulus of videos content is integrated into learnable graph structure to weight the importance of physiology on the two emotion dimensions, valence and arousal. Furthermore, we evaluate our proposed framework on two public emotion databases with a rigorous cross validation method, and our model achieves the best unweighted average recall (UAR), which is 67.9%, 56.9% for arousal and 79.8%, 70.4% for valence on the cross datasets recognition experiments respectively. Further analyses reveal that 1) VGLCN is especially effective on transfer valence binary-task, 2) the physiological features (ECG, EDA) are very informative features for emotion recognition and 3) the affective media videos are important constraint to be included in the framework to stabilize the performance power.

Abstract:
The crux of homography estimation is that the homography is characterized by the geometric correspondences between two related images rather than appearance features, which differs from typical image recognition tasks. Existing methods either decompose the task of homography estimation into several individual sub-problems and optimize them sequentially, or attempt to tackle it in an end-to-end manner by delegating the whole task to deep convolutional networks (CNNs). However, it is quite arduous for CNNs to learn the mapping function from appearance features of related images to the homography directly. In this paper, we propose to parse the geometric correspondences between related images explicitly to bridge the gap between deep appearance features and the homography. Furthermore, we propose a coarse-to-fine estimation framework to capture different scale of homography transformations and thus predict the homography in a stepwise-refining manner. Additionally, we propose a pyramidal supervision scheme to leverage an important prior concerning the homography estimation. Extensive experiments on two large-scale datasets demonstrate that our model advances the state-of-the-art performance significantly.

Abstract:
Due to the wide range of different natural temporal and spatial distortions appearing in user generated video content, blind assessment of natural video quality is a challenging research problem. In this study, we combine the hand-crafted statistical temporal features used in a state-of-the-art video quality model and spatial features obtained from convolutional neural network trained for image quality assessment via transfer learning. Experimental results on two recently published natural video quality databases show that the proposed model can predict subjective video quality more accurately than the publicly available video quality models representing the state-of-the-art. The proposed model is also competitive in terms of computational complexity.

Abstract:
Online micro-video recommender systems aim to address the information explosion of micro-videos and make the personalized recommendation for users. However, the existing methods still have some limitations in learning representative user interests, since the multi-scale time effects, user interest group modeling, and false positive interactions are not taken into consideration. In view of this, we propose an end-to-end Multi-scale Time-aware user Interest modeling Network (MTIN). In particular, we first present an interest group routing algorithm to generate fine-grained user interest groups based on user's interaction sequence. Afterwards, to explore multi-scale time effects on user interests, we design a time-aware mask network and distill multiple temporal information by several parallel temporal masks. And then an interest mask network is introduced to aggregate fine-grained interest groups and generate the final user interest representation. At last, in the prediction unit, the user representation and micro-video candidates are fed into a deep neural network (DNN) for predictions. To demonstrate the effectiveness of our method, we conduct experiments on two publicly available datasets, and the experimental results demonstrate that our proposed model achieves substantial gains over the state-of-the-art methods.

Abstract:
Vehicle re-identification (re-ID) aims to retrieve the image of the same vehicles across multiple cameras. It has attracted wide attention in the field of computer vision owing to the deployment of surveillance system. However, some unfavorable factors restrict the retrieval accuracy of re-ID; minor inter-class difference and orientation variation are two main issues. In this study, we proposed a multi-branch network based on common field of view (CFVMNet) to address these issues. In the proposed method, we extracted and fused the global and local detail features using four branches and the Batch DropBlock (BDB) strategy to accentuate inter-class difference. We also considered some other attributes (i.e., color, type, and model) in the feature extraction process to make the final features more recognizable. For the issue of orientation variation that could lead to large intra-class difference, we learned two different metrics according to whether there is common field of view of two vehicle images, respectively, which can enable the proposed CFVMNet to focus on different regions. Extensive experiments on two public datasets, VeRi-776 and VehicleID, show that the proposed method outperformed the state-of-the-art approaches to vehicle re-ID.

Abstract:
The use of immersive virtual environments (IVEs) for educational purposes has increased in recent years, but the mechanisms through which they contribute to learning is still unclear. Popular explanations for the learning benefits brought by IVEs come from motivation, presence and embodied perspectives; either as individual benefits or through mediation effects on each other. This paper describes an experiment designed to interrogate these approaches, and provides evidence that embodied controls and presence encourage learning in immersive virtual environments, but for distinct,non-interacting reasons, which are also not explained by motivational benefits.

Abstract:
6DoF object pose estimation is essential for many real-world applications. Although great progress has been made, challenges still remain in estimating 6D pose for occluded objects. Current RGB-D approaches predict 6DoF pose directly, which is sensitive to occlusion in cluttered scenes. In this work, we propose DCNet, an end-to-end framework for estimating 6DoF object poses. DCNet first converts pixels in the image plane to point clouds in the camera coordinate system and then establishes dense correspondences between the camera coordinate system and the object coordinate system. Based on these two systems, we fuse 2D appearance and 3D geometric features by pixel-wise concatenation to construct dense correspondences, from which the pose is calculated through the least-squares fitting algorithm. Dense correspondences guarantee enough point pairs for a robust 6DoF pose estimation, even if the occlusion is heavy. Experimental results demonstrate that DCNet outperforms the state-of-the-art methods on LINEMOD, Occlusion LINEMOD and YCB-Video datasets, especially in terms of the robustness to occlusion scenes.

Abstract:
Social Media Popularity (SMP) prediction focuses on predicting the social impact of a given post from a specific user in social media, which is crucial for online advertising, social recommendation, and demand prediction. In this paper, we present HyFea, our winning solution to the Social Media Prediction (SMP) Challenge for multimedia grand challenge of ACM Multimedia 2020. To address the multi-modality and personality issues of this challenge, HyFea carefully considers multiple feature types and adopts a tree-based ensembling method, i.e., CatBoost, which is shown to perform well in prediction. Specifically, HyFea involves the features related to Image, Category, Space-Time, User Profile, Tag, and Others. We conduct several experiments on the Social Media Prediction Dataset (SMPD), verifying the positive contributions of each type of features.

Abstract:
Popularity prediction of social posts is one of the most critical issues for social media analysis and understanding. In this paper, we discover a more dominant feature representation of text information, as well as propose a singe ensemble learning model to obtain the popularity scores, for social media prediction challenge. However, most social media prediction techniques focus on predicting the popularity score of social posts based on a single model, such as deep learning-based or ensemble learning-based approaches. However, it is well-known that the model stacking strategy is a more effective way to boost the performance on various regression tasks. In this paper, we also show that the model stacking can be modeled as a simple recurrent neural network problem with comparable performance on predicting popularity scores. Firstly, a single strong baseline is proposed based on the deep neural network with a prediction branch. Then, the partial feature maps of the last layer of our strong baseline are used to establish a new branch with an isolated predictor. It is easy to obtain multi-prediction by repeating the above two steps. These preliminary predicted scores are then formed as the input of the recurrent unit to learn the final predicted scores, called Recurrent Stacking Model (RSM). Our experiments show that the proposed ensemble learning approach outperforms other state-of-the-art methods. Furthermore, the proposed RSM also shows the superiority over our ensemble learning approach, having verified that the model stacking problem can be transformed into the training problem of a recurrent neural network.

Abstract:
Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction. Differently, this work aims to unify all these in a single tracking system. Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which consists of three functional branches: (1) the detection branch localizes object instances; (2) the Siamese-based track branch estimates the object motion and (3) the object re-identification branch re-activates the previously terminated tracks when they re-emerge. We used this design and apply it to the Human in Events dataset.

Abstract:
The series of FAT/FAccT events aim at bringing together researchers and practitioners interested in fairness, accountability, transparency and ethics of computational methods. The FATE/MM workshop focuses on addressing these issues in the Multimedia field. Multimedia computing technologies operate today at an unprecedented scale, with a growing community of scientists interested in multimedia models, tools and applications. Such continued growth has great implications not only for the scientific community, but also for the society as a whole. Typical risks of large-scale computational models include model bias and algorithmic discrimination. These risks become particularly prominent in the multimedia field, which historically has been focusing on user-centered technologies. To ensure a healthy and constructive development of the best multimedia technologies, this workshop offers a space to discuss how to develop ethical, fair, unbiased, representative, and transparent multimedia models, bringing together researchers from different areas to present computational solutions to these issues.

Abstract:
SUMAC 2020 is the second edition of the workshop on Structuring and Understanding of Multimedia heritAge Contents. It is held in Seattle, USA on October 12th, 2020 and is co-located with the 28th ACM International Conference on Multimedia; this year, due to the sanitary crisis, it is organized virtually. Its objective is to present and discuss the latest and most significant trends and challenges in the analysis, structuring and understanding of multimedia contents dedicated to the valorization of heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field.

Abstract:
Theworld iswelcoming the newnormal - the coronavirus pandemic has significantly changed the way people live, work, communicate and learn. Almost everyone now is wearing a face mask when they go in public. People are working from home, some taking care of children at the same time. Bars and restaurants are limited to carry-out and delivery only. Meetings and conferences go online. Schools are closed and educators are instead holding video conference classes regularly. All these become the new normal as our ways of life. The panel thus provides a valuable opportunity for people from a variety of backgrounds to exchange views on opportunities and challenges for AI multimedia in the current and post pandemics era.

Abstract:
Many cognitive researches have shown that human may 'see voices' or 'hear faces', and such ability can be potentially associated by machine vision and intelligence. However, this research is still under early stage. In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. Within the proposed framework, we exploit a simple and efficient adversarial learning architecture to learn the cross-modal embeddings between faces and voices, which consists of two subnetworks, respectively, for generator and discriminator. The former subnetwork is designed to adaptively discriminate the high-level semantical features between voices and faces, in which the triplet loss and multi-modal center loss are in tandem utilized to explicitly regularize the correspondences among them. The latter subnetwork is further leveraged to maximally bridge the semantic gap between the representations of voice and face data, featuring on maintaining the semantic consistency. Through the joint exploitation of the above, the proposed framework can well push representations of voice-face data from the same person closer while pulling those representations of different person away. Extensive experiments empirically show that the proposed approach involves fewer parameters and calculations, adapts various cross-modal matching tasks for voice-face data and brings substantial improvements over the state-of-the-art methods.

Abstract:
The long and unconstrained nature of egocentric videos makes it imperative to use temporal segmentation as an important pre-processing step for many higher-level inference tasks. Activities of the wearer in an egocentric video typically span over hours and are often separated by slow, gradual changes. Furthermore, the change of camera viewpoint due to the wearer's head motion causes frequent and extreme, but, spurious scene changes. The continuous nature of boundaries makes it difficult to apply traditional Markov Random Field (MRF) pipelines relying on temporal discontinuity, whereas deep Long Short Term Memory (LSTM) networks gather context only upto a few hundred frames, rendering them ineffective for egocentric videos. In this paper, we present a novel unsupervised temporal segmentation technique especially suited for day-long egocentric videos. We formulate the problem as detecting concept drift in a time-varying, non i.i.d. sequence of frames. Statistically bounded thresholds are calculated to detect concept drift between two temporally adjacent multivariate data segments with different underlying distributions while establishing guarantees on false positives. Since the derived threshold indicates confidence in the prediction, it can also be used to control the granularity of the output segmentation. Using our technique, we report significantly improved state of the art f-measure for daylong egocentric video datasets, as well as photostream datasets derived from them: HUJI~(73.01%, 59.44%), UTEgo~(58.41%, 60.61%) and Disney~(67.63%, 68.83%).

Abstract:
Deep learning based LiDAR odometry (LO) estimation attracts increasing research interests in the field of autonomous driving and robotics. Existing works feed consecutive LiDAR frames into neural networks as point clouds and match pairs in the learned feature space. In contrast, motivated by the success of image based feature extractors, we propose to transfer the LiDAR frames to image space and reformulate the problem as image feature extraction. With the help of scale-invariant feature transform (SIFT) for feature extraction, we are able to generate matched keypoint pairs (MKPs) that can be precisely returned to the 3D space. A convolutional neural network pipeline is designed for LiDAR odometry estimation by extracted MKPs. The proposed scheme, namely LodoNet, is then evaluated in the KITTI odometry estimation benchmark, achieving on par with or even better results than the state-of-the-art.

Abstract:
Machine vision of human facial expressions has been studied for decades, from prototypical expressions to Action Units (AUs), from hand-crafted to deep features, from multi-class to multi-label classifications. Since the widely adopted deep networks lack interpretation on learnt representations, human prior knowledge cannot be effectively imposed and examined. On the other hand, AU is a human defined concept. In order to align with this idea, a finer level of network design is desired. In this paper, we first extend the heatmaps to ROI maps, encoding the location of both positive and negative occurred AUs, then employ a well-designed backbone network to regress it. In this way, AU detection is performed in two stages, key regions localization and occurrence classification. To prompt the spatial dependency among ROIs, we utilize graph convolution for feature refinement. The decomposition of similarity matrix is supervised by AU labels. This novel framework is evaluated on two benchmark databases (BP4D and DISFA) for AU detection. The experimental results are superior to the state-of-the-art algorithms and baseline models, demonstrating the effectiveness of our proposed method.

Abstract:
The swift development of the multimedia technology has raised dramatically the users' expectation on the quality of experience. To obtain the ground-truth perceptual quality for model training, subjective assessment is necessary. Crowdsourcing platform provides us a convenient and feasible way to run large-scale experiments. However, the obtained perceptual quality labels are generally noisy. In this paper, we propose a probabilistic graphical annotation model to infer the underlying ground truth and discovering the annotator's behavior. In the proposed model, the ground truth quality label is considered following a categorical distribution rather than a unique number, i.e., different reliable opinions on the perceptual quality are allowed. In addition, different annotator's behaviors in crowdsourcing are modeled, which allows us to identify the possibility that the annotator makes noisy labels during the test. The proposed model has been tested on both simulated data and real-world data, where it always shows superior performance than the other state-of-the-art models in terms of accuracy and robustness.

Abstract:
Video story question answering (video story QA) is a challenging problem, as it requires a joint understanding of diverse data sources (i.e., video, subtitle, question, and answer choices). Existing approaches for video story QA have several common defects: (1) single temporal scale; (2) static and rough multimodal interaction; and (3) insufficient (or shallow) exploitation of both question and answer choices. In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. The proposed DHTCN explores multiple temporal scales by building hierarchical temporal convolutional network. In each temporal convolutional layer, two key components, namely AttLSTM and QA-Aware Dynamic Normalization, are introduced to capture the temporal dependency and the multimodal interaction in a dynamic and fine-grained manner. To enable sufficient exploitation of both question and answer choices, we increase the depth of QA pairs with a stack of non-linear layers, and exploit QA pairs in each layer of the network. Extensive experiments are conducted on two widely used datasets: TVQA and MovieQA, demonstrating the effectiveness of DHTCN. Our model obtains state-of-the-art results on the both datasets.

Abstract:
We revisit our contributions on visual sentiment analysis for online review images published at ACM Multimedia 2017, where we develop item-oriented and user-oriented convolutional neural networks that better capture the interaction of image features with specific expressions of users or items. In this work, we outline the experimental claims as well as describe the procedures to reproduce the results therein. In addition, we provide artifacts including data sets and code to replicate the experiments.

Abstract:
Video streaming platforms are required to innovate their delivery pipeline to allow new and more immersive video content to be supported. In particular, Omnidirectional videos enable the user to explore a 360° scene by moving their heads using Head Mounted Display devices. Viewport adaptive streaming allows changing dynamically the quality of the video falling in the user's field of view. In this paper, we present TAPAS-360°, an open-source tool that enables designing and experimenting all the components required to build omnidirectional video streaming systems. The tool can be used by researchers focusing on the design of viewport-adaptive algorithms and also to produce video streams to be employed for subjective and objective Quality of Experience evaluations.

Abstract:
Public AI-as-a-Service (AIaaS) is a promising next-generation computing paradigm that attracts resource-limited mobile users to outsource their machine learning tasks. However, the time delay between cloud/edge servers and end users makes it hard for real-time mobile artificial intelligence applications. In this demonstration, we present EmotionTracker, a real-time mobile facial expression tracking system combining AIaaS and mobile local auxiliary computing, including facial expression tracking and the corresponding task offloading. Mobile facial expression tracking iteratively estimates the facial expression with the help of sparse optical flow and neural network. Task offloading dynamically estimate the moment of task offloading with machine learning method. According to the results in a real-world environment, EmotionTracker successfully fulfills the mobile real-time facial expression tracking requirements.

Abstract:
The blockchain technology provides a data authentication and permanent storage solution to the data volatility issue in peer-to-peer games. In this work, we present the Infinity Battle, a serverless turn-based strategy game supported by a novel Proof-of-Play consensus model. Comprising three major phases: matchmaking, gaming session and global synchronization, the proposed demo game generates a blockchain through distributed storage and processing.

Abstract:
Visual contexts often help to recognize named entities more precisely in short texts such as tweets or snapchat. For example, one can identify "Charlie'' as a name of a dog according to the user posts. Previous works on multimodal named entity recognition ignore the corresponding relations of visual objects and entities. Visual objects are considered as fine-grained image representations. For a sentence with multiple entity types, objects of the relevant image can be utilized to capture different entity information. In this paper, we propose a neural network which combines object-level image information and character-level text information to predict entities. Vision and language are bridged by leveraging object labels as embeddings, and a dense co-attention mechanism is introduced for fine-grained interactions. Experimental results in Twitter dataset demonstrate that our method outperforms the state-of-the-art methods.

Abstract:
To distinguish the subtle differences among fine-grained categories, a large amount of well-labeled images are typically required. However, manual annotations for fine-grained categories is an extremely difficult task as it usually has a high demand for professional knowledge. To this end, we propose to directly leverage web images for fine-grained visual recognition. Our work mainly focuses on two critical issues including "label noise" and "domain mismatch" in the web images. Specifically, we propose an end-to-end deep denoising network (DDN) model to jointly solve these problems in the process of web images selection. To verify the effectiveness of our proposed approach, we first collect web images by using the labels in fine-grained datasets. Then we apply the proposed deep denoising network model for noise removal and domain mismatch alleviation. We leverage the selected web images as the training set for fine-grained categorization models learning. Extensive experiments and ablation studies demonstrate state-of-the-art performance gained by our proposed approach, which, at the same time, delivers a new pipeline for fine-grained visual categorization that is to be highly effective for real-world applications.

Abstract:
Essentially, the current concept of multimedia is limited to presenting what people see in their eyes. What people think inside brains, however, remains a rich source of multimedia, such as imaginations of paradise and memories of good old days etc. In this paper, we propose a dual conditioned and lateralization supported GAN (DCLS-GAN) framework to learn and visualize the brain thoughts evoked by stimulating images and hence enable multimedia to reflect not only what people see but also what people think. To reveal such a new world of multimedia inside human brains, we coin such an attempt as "brain-media". By examining the relevance between the visualized image and the stimulation image, we are able to measure the efficiency of our proposed deep framework regarding the quality of such visualization and also the feasibility of exploring the concept of "brain-media". To ensure that such extracted multimedia elements remain meaningful, we introduce a dually conditioned learning technique in the proposed deep framework, where one condition is analyzing EEGs through deep learning to extract a class-dependent and more compact brain feature space utilizing the distinctive characteristics of hemispheric lateralization and brain stimulation, and the other is to extract expressive visual features assisting our automated analysis of brain activities as well as their visualizations aided by artificial intelligence. To support the proposed GAN framework, we create a combined-conditional space by merging the brain feature space with the visual feature space provoked by the stimuli. Extensive experiments are carried out and the results show that our proposed deep framework significantly outperforms the representative existing state-of-the-arts under several settings, especially in terms of both visualization and classification of brain responses to the evoked images. For the convenience of research dissemination, we make the source code openly accessible for downloading at GitHub.

Abstract:
We present a novel framework for human video motion transfer. Deviating from recent studies that use only single source image, we propose to allow users to supply multiple source images by simply imitating some poses in the desired target video. To aggregate the appearance from multiple input images, we propose a JAFPro framework that incorporates two modules: an appearance fusion module that adaptively fuses the information in the supplied images and an appearance propagation module that propagates textures through flow-based warping to further improve the result. An attractive feature of JAFPro is that the quality of its results progressively improves as more imitating images are supplied. Furthermore, we build a new dataset containing a large variety of dancing videos in the wild. Extensive experiments conducted on this dataset demonstrate JAFPro outperforms state-of-the-art methods both qualitatively and quantitatively. We will release our code and dataset upon publication of this work.

Abstract:
For evaluating the appearance and design language of car interiors, the surface quality and shapes are inspected by highly trained professionals. At the same time, virtual reality (VR) is making major progress, pushing the boundaries of this technology. In this paper, we evaluate the applicability of VR using head mounted displays (HMDs) in an experiment where we had experts examine the design quality of an interior in VR and compared the results with the examination on a powerwall as well as in reality. Our goal is to find out in how far current VR hardware can be used in the automotive industry and which advantages and disadvantages occur. Our results show that the experts are able to detect an amount of flaws comparable to reality with the powerwall being the superior medium in terms of flaws identified. Additionally, symptoms of cybersickness and a reduced lack of confidence was measured in the subjects using a HMD.

Abstract:
Cross-model retrieval has attracted much attention in recent years due to its wide applications. Conventional approaches usually take one modality as query to retrieve relevant data of another modality. In this paper, we devote to an emerging task in cross-modal retrieval, Composing Text and Image to Image Retrieval (CTI-IR), which aims at retrieving images relevant to a query image with text describing desired modifications to the query image. Compared with conventional cross-modal retrieval, the new task is particularly useful for the retrieval that the query image does not perfectly match the user's expectations. Generally, the CTI-IR involves two underlying problems: how to manipulate visual features of the query image specified by the text, and how to model the modality gap between the query and target. Most previous methods focus on solving the second problem. In this paper, we aim to deal with both problems simultaneously in a unified model. Specifically, the proposed method is based on the graph attention network and adversarial learning network, which enjoys several merits. First, the query image and the modification text are constructed in a relation graph for learning text-adaptive representations. Second, semantic contents from the text are injected into the visual features through graph attention. Third, an adversarial loss is incorporated into the conventional cross-modal retrieval loss to learn more discriminative modality invariant representations for CTI-IR. Extensive experiments on three benchmark datasets demonstrate that the proposed method performs favorably against state-of-the-art methods.

Abstract:
Existing data acquisition literature for human behavior research provides wired solutions, mainly for controlled laboratory setups. In uncontrolled free-standing conversation settings, where participants are free to walk around, these solutions are unsuitable. While wireless solutions are employed in the broadcasting industry, they can be prohibitively expensive. In this work, we propose a modular and cost-effective wireless approach for synchronized multisensor data acquisition of social human behavior. Our core idea involves a cost-accuracy trade-off by using Network Time Protocol (NTP) as a source reference for all sensors. While commonly used as a reference in ubiquitous computing, NTP is widely considered to be insufficiently accurate as a reference for video applications, where Precision Time Protocol (PTP) or Global Positioning System (GPS) based references are preferred. We argue and show, however, that the latency introduced by using NTP as a source reference is adequate for human behavior research, and the subsequent cost and modularity benefits are a desirable trade-off for applications in this domain. We also describe one instantiation of the approach deployed in a real-world experiment to demonstrate the practicality of our setup in-the-wild.

Abstract:
Technological developments in comprehensive video understanding - detecting and identifying visual elements of a scene, combined with audio understanding (music, speech), as well as aligned with textual information such as captions, subtitles, etc. and background knowledge - have been undergoing a significant revolution during recent years. The workshop brings together experts from academia and industry in order to discuss the latest progress in artificial intelligence research in topics related to multimodal information analysis, and in particular, semantic analysis of video, audio, and textual information for smart digital TV content production, access and delivery.

Abstract:
The first Multimodal Sentiment Analysis in Real-life Media (MuSe) 2020 was a Challenge-based Workshop held in conjunction with ACM Multimedia'20. It addresses three distinct 'in-the-wild` Sub-challenges: sentiment/ emotion recognition (MuSe-Wild), emotion-target engagement (MuSe-Target) and trustworthiness detection (MuSe-Trust). A large multimedia dataset MuSe-CaR was used, which was specifically designed with the intention of improving machine understanding approaches of how sentiment (e.g. emotion) is linked to a topic in emotional, user-generated reviews. In this summary, we describe the motivation, first of its kind 'in-the-wild` database, challenge conditions, participation, as well as giving an overview of utilised state-of-the-art techniques.

Abstract:
In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people seen during the training phase. However, they fail to accurately morph the lip movements of arbitrary identities in dynamic, unconstrained talking face videos, resulting in significant parts of the video being out-of-sync with the new audio. We identify key reasons pertaining to this and hence resolve them by learning from a powerful lip-sync discriminator. Next, we propose new, rigorous evaluation benchmarks and metrics to accurately measure lip synchronization in unconstrained videos. Extensive quantitative evaluations on our challenging benchmarks show that the lip-sync accuracy of the videos generated by our Wav2Lip model is almost as good as real synced videos. We provide a demo video clearly showing the substantial impact of our Wav2Lip model, and also publicly release the code, models, and evaluation benchmarks on our website.

Abstract:
We propose a new Movie Map, which will enable users to explore a given city area using omnidirectional videos. Only one Movie Map prototype was developed in the 1980s; it was developed with analog video technology. Later, Google Street View (GSV) provided interactive panoramas from positions along streets around the world in Google Maps. Despite the wide use of GSV, it provides sparse images of streets, which often confuses users and lowers user satisfaction. Movie Map's use of videos instead of sparse images dramatically improves the user experience. Thus, we improve the Movie Map using state-of-the-art technology. We propose a new Movie Map system, with an interface for exploring cities. The system consists of four stages; acquisition, analysis, management, and interaction. In the acquisition stage, omnidirectional videos are taken along streets in target areas. Frames of the video are localized on the map, intersections are detected, and videos are segmented. Turning views at intersections are subsequently generated. By connecting the video segments following the specified movement in an area, we can view the streets better. The interface allows for easy exploration of a target area, and it can show virtual billboards of stores in the view. We conducted user studies to compare our system to the GSV in a scenario where users could freely move and explore to find a landmark. The experiment showed that our system had a better user experience than GSV.

Abstract:
Most methods for RGB-D salient object detection (SOD) utilize the same fusion strategy to explore the cross-modal complementary information at each level. However, this may ignore different feature contributions from two modalities on different levels towards prediction. In this paper, we propose a novel top-down multi-level fusion structure where different fusion strategies are utilized to effectively explore the low-level and high-level features. This is achieved by designing the interweave fusion module (IFM) to effectively integrate the global information and designing the gated select fusion module (GSFM) to discriminatively select useful local information by filtering out the unnecessary one from RGB and depth data. Moreover, we propose an adaptive fusion module (AFM) to reintegrate the fused cross-modal features of each level to predict a more accurate result. Comprehensive experiments on 7 challenging benchmark datasets demonstrate that our method achieves the competitive performance over 14 state-of-the-art RGB-D alternative methods.

Abstract:
Despite remarkable advances in medical data analysis fields, they are severely restrained from the limited property of the employed single modality, usually medical imaging data. However, other modalities (such as patient-related information) should also be taken into account in the process of clinical decision. How to fully employ the multi-modal dataset is still under-explored. In this paper, we make a quantitative comparison of different machine learning approaches for the human spermatozoa quality prediction task, leveraging multiple modalities dataset. To empirically investigate the advantages and disadvantages of different machine learning approaches, we perform extensive experiments. Leveraging different features, we achieve state-of-the-art performance on most of the tasks. The obtained results show that simple models can provide better performance, which emphasizes the importance of avoiding overfitting. For the sake of reproducibility, we have released our code to facilitate the research community.

Abstract:
This paper deals with a challenging task of learning from different modalities by tackling the difficulty problem of jointly face recognition between abstract-like sketches, cartoons, caricatures and real-life photographs. Due to the significant variations in the abstract faces, building vision models for recognizing data from these modalities is an extremely challenging. We propose a novel framework termed as Meta-Continual Learning with Knowledge Embedding to address the task of jointly sketch, cartoon, and caricature face recognition. In particular, we firstly present a deep relational network to capture and memorize the relation among different samples. Secondly, we present the construction of our knowledge graph that relates image with the label as the guidance of our meta-learner. We then design a knowledge embedding mechanism to incorporate the knowledge representation into our network. Thirdly, to mitigate catastrophic forgetting, we use a meta-continual model that updates our ensemble model and improves its prediction accuracy. With this meta-continual model, our network can learn from its past. The final classification is derived from our network by learning to compare the features of samples. Experimental results demonstrate that our approach achieves significantly higher performance compared with other state-of-the-art approaches.

Abstract:
The Joint Workshop on Aesthetic and Technical Quality Assessment of Multimedia and Media Analytics for Societal Trends (ATQAM/ MAST) aims to bring together researchers and professionals working in fields ranging from computer vision, multimedia computing, multimodal signal processing to psychology and social sciences. It is divided into two tracks: ATQAM and MAST. ATQAM track: Visual quality assessment techniques can be divided into image and video technical quality assessment (IQA and VQA, or broadly TQA) and aesthetics quality assessment (AQA). While TQA is a long-standing field, having its roots in media compression, AQA is relatively young. Both have received increased attention with developments in deep learning. The topics have mostly been studied separately, even though they deal with similar aspects of the underlying subjective experience of media. The aim is to bring together individuals in the two fields of TQA and AQA for the sharing of ideas and discussions on current trends, developments, issues, and future directions. MAST track: The research area of media content analytics has been traditionally used to refer to applications involving inference of higher-level semantics from multimedia content. However, multimedia is typically created for human consumption, and we believe it is necessary to adopt a human-centered approach to this analysis, which would not only enable a better understanding of how viewers engage with content but also how they impact each other in the process.

Abstract:
Large-scale events pose severe challenges to live video streaming service providers, who need to cope with high, peaking viewer numbers and the resulting fluctuating resource demands, keeping high levels of Quality of Experience (QoE) to avoid end-user frustration and churn. In this paper, we analyze a unique dataset consisting of more than a million 2018 FIFA World Cup mobile live streaming sessions, collected at a large national public broadcaster. Different from previous work, we analyze QoE and user engagement as well as their interaction, in dependency to specific soccer match events, which have the potential to trigger flash crowds during a match. Flash crowds are a particular challenge to video service providers, since they cause sudden load peaks and consequently, the likelihood of quality problems. We further exploit the data to model viewer engagement over the course of a soccer match, and show that client counts follow very similar patterns of change across all matches. We believe that the analysis as well as the resulting models are valuable sources of insight for service providers, equipping them with tools for customer-centric resource and capacity management.

Abstract:
3D shape captioning is a challenging application in 3D shape understanding. Captions from recent multi-view based methods reveal that they cannot capture part-level characteristics of 3D shapes. This leads to a lack of detailed part-level description in captions, which human tend to focus on. To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. Our novelty lies in learning the knowledge of part detection in multiple views from 3D shape segmentations and transferring this knowledge to facilitate learning the mapping from 3D shapes to sentences. Specifically, ShapeCaptioner aggregates the parts detected in multiple colored views using our novel part class specific aggregation to represent a 3D shape, and then, employs a sequence to sequence model to generate the caption. Our outperforming results show that ShapeCaptioner can learn 3D shape features with more detailed part characteristics to facilitate better 3D shape captioning than previous work.

Abstract:
Reducing misdiagnosis rate is a central concern in modern medicine. In clinical practice, group-based collective diagnosis is frequently exercised to curb the misdiagnosis rate. However, little effort has been dedicated to emulating the collective intelligence behind the group-based decision making practice in computer-aided diagnosis research to this day. To fill the overlooked gap, this study introduces a novel deep neural network, titled PanelNet, that is able to computationally model and reproduce the aforesaid collective diagnosis capability demonstrated by a group of medical experts. To experimentally explore the validity of the new solution, we apply the proposed PanelNet to one of the key tasks in radiology---assessing malignant ratings of pulmonary nodules. For each nodule and a given panel, PanelNet is able to predict statistical distribution of malignant ratings collectively judged by the panel of radiologists. Extensive experimental results consistently demonstrate PanelNet outperforms multiple state-of-the-art computer-aided diagnosis methods applicable to the collective diagnostic task. To our best knowledge, no other collective computer-aided diagnosis method grounded on modern machine learning technologies has been previously proposed. By its design, PanelNet can also be easily applied to model collective diagnosis processes employed for other diseases.

Abstract:
Not only the current coronavirus is holding the world in breath. Beyond this current health crisis the world is facing several global challenges from climate change and environmental damage, access to clean water and food, socio-economic inequalities to name a few. The United have very well framed these global challenges in their 17 Sustainability Goals for a future in prosperity and equal opportunities for all, to be achieved by 2030. There is no one simple solution, no one easy cure in sight to address these pressing challenges of our days. Rather a collective approach of all of us is needed which in sum will be contributing to these. Obviously, the field of multimedia has contributed to many tools and applications that are so much in demand these days to stay connected while keeping the distance. But there is much more we can offer to our common digital future. Our future health system, global access to education, decent work, and reducing inequalities are just some of these goals where we our field can contribute. In this panel we will discuss which path we could follow.