ACMMM2021

Abstract:
The aim of person re-identification (Re-ID) is retrieving a person of interest across multiple non-overlapping cameras. Re-ID has gained significantly increased advancement in recent years. However, real data annotation is costly and model generalization ability is hindered by the lack of large-scale and diverse data. To address this problem, we propose a Weather Person pipeline that can generate a synthesized Re-ID dataset with different weather, scenes, and natural lighting conditions automatically. The pipeline is built on the top of a game engine which contains a digital city, weather and lighting simulation system, and various character models with manifold dressing. To train a generalizable Re-ID model from the large-scale virtual WePerson dataset, we design an adaptive sample selection strategy to close the domain gap and avoid redundancy. We also design an informative sampling method for a mini-batch sampler to accelerate the learning process. In addition, an efficient training method is introduced by adopting instance normalization to capture identity invariant components from various appearances. We evaluate our pipeline using direct transfer on 3 widely-used real-world benchmarks, achieving competitive performance without any real-world image training. This dataset starts the attempt to evaluate diverse environmental factors in a controllable virtual engine, which provides important guidance for future generalizable Re-ID model design. Notably, we improve the current state-of-the-art accuracy from 38.5% to 46.4% on the challenging MSMT17 dataset. Dataset and code are available at https://github.com/lihe404/WePerson https://github.com/lihe404/WePerson.

Abstract:
Recently, there has been an increasing concern about the privacy issue raised by using personally identifiable information in machine learning. However, previous portrait matting methods were all based on identifiable portrait images. To fill the gap, we present P3M-10k in this paper, which is the first large-scale anonymized benchmark for Privacy-Preserving Portrait Matting. P3M-10k consists of 10,000 high-resolution face-blurred portrait images along with high-quality alpha mattes. We systematically evaluate both trimap-free and trimap-based matting methods on P3M-10k and find that existing matting methods show different generalization capabilities when following the Privacy-Preserving Training (PPT) setting, i.e., training on face-blurred images and testing on arbitrary images. To devise a better trimap-free portrait matting model, we propose P3M-Net, which leverages the power of a unified framework for both semantic perception and detail matting, and specifically emphasizes the interaction between them and the encoder to facilitate the matting process. Extensive experiments on P3M-10k demonstrate that P3M-Net outperforms the state-of-the-art methods in terms of both objective metrics and subjective visual quality. Besides, it shows good generalization capacity under the PPT setting, confirming the value of P3M-10k for facilitating future research and enabling potential real-world applications. The source code and dataset are available at https://github.com/JizhiziLi/P3M.

Abstract:
The Deep Neural Networks are vulnerable to adversarial examples (Figure 1), making the DNNs-based systems collapsed by adding the inconspicuous perturbations to the images. Most of the existing works for adversarial attack are gradient-based and suffer from the latency efficiencies and the load on GPU memory. The generative-based adversarial attacks can get rid of this limitation, and some relative works propose the approaches based on GAN. However, suffering from the difficulty of the convergence of training a GAN, the adversarial examples have either bad attack ability or bad visual quality. In this work, we find that the discriminator could be not necessary for generative-based adversarial attack, and propose the Symmetric Saliency-based Auto-Encoder (SSAE) to generate the perturbations, which is composed of the saliency map module and the angle-norm disentanglement of the features module. The advantage of our proposed method lies in that it is not depending on discriminator, and uses the generative saliency map to pay more attention to label-relevant regions. The extensive experiments among the various tasks, datasets, and models demonstrate that the adversarial examples generated by SSAE not only make the widely-used models collapse, but also achieves good visual quality. The code is available at: https://github.com/BravoLu/SSAE.

Abstract:
Video procedural captioning (VPC), which generates procedural text from instructional videos, is an essential task for scene understanding and real-world applications. The main challenge of VPC is to describe how to manipulate materials accurately. This paper focuses on this challenge by designing a new VPC task, generating a procedural text from the clip sequence of an instructional video and material list. In this task, the state of materials is sequentially changed by manipulations, yielding their state-aware visual representations (e.g., eggs are transformed into cracked, stirred, then fried forms). The essential difficulty is to convert such visual representations into textual representations; that is, a model should track the material states after manipulations to better associate the cross-modal relations. To achieve this, we propose a novel VPC method, which modifies an existing textual simulator for tracking material states as a visual simulator and incorporates it into a video captioning model. Our experimental results show the effectiveness of the proposed method, which outperforms state-of-the-art video captioning models. We further analyze the learned embedding of materials to demonstrate that the simulators capture their state transition. The code and dataset are available from https://github.com/misogil0116/svpc

Abstract:
For an image with multiple scene texts, different people may be interested in different text information. Current text-aware image captioning models are not able to generate distinctive captions according to various information needs. To explore how to generate personalized text-aware captions, we define a new challenging task, namely Question-controlled Text-aware Image Captioning (Qc-TextCap). With questions as control signals, this task requires models to understand questions, find related scene texts and describe them together with objects fluently in human language. Based on two existing text-aware captioning datasets, we automatically construct two datasets, ControlTextCaps and ControlVizWiz to support the task. We propose a novel Geometry and Question Aware Model (GQAM). GQAM first applies a Geometry-informed Visual Encoder to fuse region-level object features and region-level scene text features with considering spatial relationships. Then, we design a Question-guided Encoder to select the most relevant visual features for each question. Finally, GQAM generates a personalized text-aware caption with a Multimodal Decoder. Our model achieves better captioning performance and question answering ability than carefully designed baselines on both two datasets. With questions as control signals, our model generates more informative and diverse captions than the state-of-the-art text-aware captioning model. Our code and datasets are publicly available at https://github.com/HAWLYQ/Qc-TextCap.

Abstract:
Visual object navigation is a fundamental task in Embodied AI. Previous works focus on the category-wise navigation, in which navigating to any possible instance of target object category is considered a success. Those methods may be effective to find the general objects. However, it may be more practical to navigate to the specific instance in our real life, since our particular requirements are usually satisfied with specific instances rather than all instances of one category. How to navigate to the specific instance has been rarely researched before and is typically challenging to current works. In this paper, we introduce a new task of Instance Object Navigation (ION), where instance-level descriptions of targets are provided and instance-level navigation is required. In particular, multiple types of attributes such as colors, materials and object references are involved in the instance-level descriptions of the targets. In order to allow the agent to maintain the ability of instance navigation, we propose a cascade framework with Instance-Relation Graph (IRG) based navigator and instance grounding module. To specify the different instances of the same object categories, we construct instance-level graph instead of category-level one, where instances are regarded as nodes, encoded with the representation of colors, materials and locations (bounding boxes). During navigation, the detected instances can activate corresponding nodes in IRG, which are updated with graph convolutional neural network (GCNN). The final instance prediction is obtained with the grounding module by selecting the candidates (instances) with maximum probability (a joint probability of category, color and material, obtained by corresponding regressors with softmax). For the task evaluation, we build a benchmark for instance-level object navigation on AI2-Thor simulator, where over 27,735 object instance descriptions and navigation groundtruth are automatically obtained through the interaction with the simulator. The proposed model outperforms the baseline in instance-level metrics, showing that our proposed graph model can guide instance object navigation, as well as leaving promising room for further improvement. The project is available at https://github.com/LWJ312/ION.

Abstract:
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast.

Abstract:
Human vision is often adversely affected by complex environmental factors, especially in night vision scenarios. Thus, infrared cameras are often leveraged to help enhance the visual effects via detecting infrared radiation in the surrounding environment, but the infrared videos are undesirable due to the lack of detailed semantic information. In such a case, an effective video-to-video translation method from the infrared domain to the visible light counterpart is strongly needed by overcoming the intrinsic huge gap between infrared and visible fields. To address this challenging problem, we propose an infrared-to-visible (I2V) video translation method I2V-GAN to generate fine-grained and spatial-temporal consistent visible light videos by given unpaired infrared videos. Technically, our model capitalizes on three types of constraints: 1) adversarial constraint to generate synthetic frames that are similar to the real ones, 2) cyclic consistency with the introduced perceptual loss for effective content conversion as well as style preservation, and 3) similarity constraints across and within domains to enhance the content and motion consistency in both spatial and temporal spaces at a fine-grained level. Furthermore, the current public available infrared and visible light datasets are mainly used for object detection or tracking, and some are composed of discontinuous images which are not suitable for video tasks. Thus, we provide a new dataset for infrared-to-visible video translation, which is named IRVI. Specifically, it has 12 consecutive video clips of vehicle and monitoring scenes, and both infrared and visible light videos could be apart into 24352 frames. Comprehensive experiments on IRVI validate that I2V-GAN is superior to the compared state-of-the-art methods in the translation of infrared-to-visible videos with higher fluency and finer semantic details. Moreover, additional experimental results on the flower-to-flower dataset indicate I2V-GAN is also applicable to other video translation tasks. The code and IRVI dataset are available at https://github.com/BIT-DA/I2V-GAN.

Abstract:
As an essential communication way for deaf-mutes, sign languages are expressed by human actions. To distinguish human actions for sign language understanding, the skeleton which contains position information of human pose can provide an important cue, since different actions usually correspond to different poses/skeletons. However, skeleton has not been fully studied for Sign Language Translation (SLT), especially for end-to-end SLT. Therefore, in this paper, we propose a novel end-to-end Skeleton-Aware neural Network (SANet) for video-based SLT. Specifically, to achieve end-to-end SLT, we design a self-contained branch for skeleton extraction. To efficiently guide the feature extraction from video with skeletons, we concatenate the skeleton channel and RGB channels of each frame for feature extraction. To distinguish the importance of clips, we construct a skeleton-based Graph Convolutional Network (GCN) for feature scaling, i.e., giving importance weight for each clip. The scaled features of each clip are then sent to a decoder module to generate spoken language. In our SANet, a joint training strategy is designed to optimize skeleton extraction and sign language translation jointly. Experimental results on two large scale SLT datasets demonstrate the effectiveness of our approach, which outperforms the state-of-the-art methods. Our code is available at https://github.com/SignLanguageCode/SANet.

Abstract:
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. We first carry out a series of pilot experiments to show disentangling such conjunctions can lead to persistent performance improvement. Then, based on these findings, we propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art dense object detectors. Extensive experiments on MS COCO benchmark show that our approach can lead to 2.0~mAP, 2.4~mAP and 2.2~mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead. Notably, our best model reaches 55.0 mAP on the COCOtest-dev set and 93.5 AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.

Abstract:
We study the problem of knowledge tracing (KT) where the goal is to trace the students' knowledge mastery over time so as to make predictions on their future performance. Owing to the good representation capacity of deep neural networks (DNNs), recent advances on KT have increasingly concentrated on exploring DNNs to improve the performance of KT. However, we empirically reveal that the DNNs based KT models may run the risk of overfitting, especially on small datasets, leading to limited generalization. In this paper, by leveraging the current advances in adversarial training (AT), we propose an efficient AT based KT method (ATKT) to enhance KT model's generalization and thus push the limit of KT. Specifically, we first construct adversarial perturbations and add them on the original interaction embeddings as adversarial examples. The original and adversarial examples are further used to jointly train the KT model, forcing it is not only to be robust to the adversarial examples, but also to enhance the generalization over the original ones. To better implement AT, we then present an efficient attentive-LSTM model as KT backbone, where the key is a proposed knowledge hidden state attention module that adaptively aggregates information from previous knowledge hidden states while simultaneously highlighting the importance of current knowledge hidden state to make a more accurate prediction. Extensive experiments on four public benchmark datasets demonstrate that our ATKT achieves new state-of-the-art performance. Code is available at: https://github.com/xiaopengguo/ATKT.

Abstract:
Video data are distinct from images for the extra temporal dimension, which results in more content dependencies from various perspectives. It increases the difficulty of learning representation for various video actions. Existing methods mainly focus on the dependency under a specific perspective, which cannot facilitate the categorization of complex video actions. This paper proposes a novel selective dependency aggregation (SDA) module, which adaptively exploits multiple types of video dependencies to refine the features. Specifically, we empirically investigate various long-range and short-range dependencies achieved by the multi-direction multi-scale feature squeeze and the dependency excitation. Query structured attention is then adopted to fuse them selectively, fully considering the diversity of videos' dependency preferences. Moreover, the channel reduction mechanism is involved in SDA for controlling the additional computation cost to be lightweight. Finally, we show that the SDA module can be easily plugged into different backbones to form SDA-Nets and demonstrate its effectiveness, efficiency and robustness by conducting extensive experiments on several video benchmarks for action classification. The code and models will be available at https://github.com/ty-97/SDA.

Abstract:
Transformer achieves remarkable successes in understanding 1 and 2-dimensional signals (e.g., NLP and Image Content Understanding). As a potential alternative to convolutional neural networks, it shares merits of strong interpretability, high discriminative power on hyper-scale data, and flexibility in processing varying length inputs. However, its encoders naturally contain computational intensive operations such as pair-wise self-attention, incurring heavy computational burden when being applied on the complex 3-dimensional video signals. This paper presents Token Shift Module (i.e., TokShift), a novel, zero-parameter, zero-FLOPs operator, for modeling temporal relations within each transformer encoder. Specifically, the TokShift barely temporally shifts partial [Class] token features back-and-forth across adjacent frames. Then, we densely plug the module into each encoder of a plain 2D vision transformer for learning 3D video representation. It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding. Experiments on standard benchmarks verify its robustness, effectiveness, and efficiency. Particularly, with input clips of 8/12 frames, the TokShift transformer achieves SOTA precision: 79.83%/80.40% on the Kinetics-400, 66.56% on EGTEA-Gaze+, and 96.80% on UCF-101 datasets, comparable or better than existing SOTA convolutional counterparts. Our code is open-sourced in: https://github.com/VideoNetworks/TokShift-Transformer.

Abstract:
Tactile sensing plays an important role in robotic perception and manipulation tasks. To overcome the real-world limitations of data collection, simulating tactile response in a virtual environment comes as a desirable direction of robotic research. In this paper, we propose Elastic Interaction of Particles (EIP) for tactile simulation, which is capable of reflecting the elastic property of the tactile sensor as well as characterizing the fine-grained physical interaction during contact. Specifically, EIP models the tactile sensor as a group of coordinated particles, and the elastic property is applied to regulate the deformation of particles during contact. With the tactile simulation by EIP, we further propose a tactile-visual perception network that enables information fusion between tactile data and visual images. The perception network is based on a global-to-local fusion mechanism where multi-scale tactile features are aggregated to the corresponding local region of the visual modality with the guidance of tactile positions and directions. The fusion method exhibits superiority regarding the 3D geometric reconstruction task. Our code for EIP is available at https://github.com/yikaiw/EIP.

Abstract:
We propose a novel implicit feature refinement module for high-quality instance segmentation. Existing image/video instance segmentation methods rely on explicitly stacked convolutions to refine instance features before the final prediction. In this paper, we first give an empirical comparison of different refinement strategies, which reveals that the widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing convolution blocks provides competitive performance. When such block is iterated for infinite times, the block output will eventually converge to an equilibrium state. Based on this observation, the implicit feature refinement (IFR) is developed by constructing an implicit function. The equilibrium state of instance features can be obtained by fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several advantages: 1) simulates an infinite-depth refinement network while only requiring parameters of single residual block; 2) produces high-level equilibrium instance features of global receptive field; 3) serves as a plug-and-play general module easily extended to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks show that our IFR achieves improved performance on state-of-the-art image/video instance segmentation frameworks, while reducing the parameter burden (e.g. 1% AP improvement on Mask R-CNN with only 30.0% parameters in mask head). Code will be made available at \hrefhttps://github.com/lufanma/IFR.git https://github.com/lufanma/IFR.git .

Abstract:
We study and address the cross-modal retrieval problem which lies at the heart of visual-textual processing. Its major challenge lies in how to effectively learn a shared multi-modal feature space where the discrepancies of semantically related pairs, such as images and texts, are minimized regardless of their modalities. Most current methods focus on reasoning about cross-modality semantic relations within individual image-text pair to learn the common representation. However, they overlook more global, structural inter-pair knowledge within the dataset, i.e., the graph-structured semantics within each training batch. In this paper, we introduce a graph-based, semantic-constrained learning framework to comprehensively explore the intra- and inter-modality information for cross-modal retrieval. Our idea is to maximally explore the structures of labeled data in graph latent space, and use them as semantic constraints to enforce feature embeddings from the semantically-matched (image-text) pairs to be more similar and vice versa. It raises a novel graph-constrained common embedding learning paradigm for cross-modal retrieval, which is largely under-explored up to now. Moreover, a GAN-based dual learning approach is used to further improve the discriminability and model the joint distribution across different modalities. Our fully-equipped approach, called Graph-constrained Cross-modal Retrieval (GCR), is able to mine intrinsic structures of training data for model learning and enable reliable cross-modal retrieval. We empirically demonstrate that our GCR can achieve higher accuracy than existing state-of-the-art approaches on Wikipedia, NUS-WIDE-10K, PKU XMedia and Pascal Sentence datasets. Our code will be made publicly available. Code is available at https://github.com/neoscheung/GCR.

Abstract:
Polygon meshes are a popular representation in computer graphics. They efficiently provide delineations of complex 3D shapes. However, their irregular structure hinders mesh analysis efforts in deep learning frameworks; few neural networks exist to describe meshes. MeshNet is a pioneer in this direction. In this paper, we propose a novel neural network that is substantially deeper than its MeshNet predecessor. This increase in depth is achieved through our specialized convolution and pooling blocks that operate on mesh faces. Our network named MeshNet++ learns local structures at multiple scales and is also robust to shortcomings of mesh decimation. We evaluated it for the shape classification task on various data sets, and results significantly higher than state-of-the-art were observed. In particular, results demonstrated that even a small number of examples suffice for training MeshNet++. Our code is available at https://github.com/VimsLab/MeshNet2.

Abstract:
This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks' strength for video matting networks. This module computes temporal correlations for pixels adjacent to each other along the time axis in feature space, which is robust against motion noises. We also design a novel loss term to train the attention weights, which drastically boosts the video matting performance. Besides, we show how to effectively solve the trimap generation problem by fine-tuning a state-of-the-art video object segmentation network with a sparse set of user-annotated keyframes. To facilitate video matting and trimap generation networks' training, we construct a large-scale video matting dataset with 80 training and 28 validation foreground video clips with ground-truth alpha mattes. Experimental results show that our method can generate high-quality alpha mattes for various videos featuring appearance change, occlusion, and fast motion. Our code and dataset can be found at: https://github.com/yunkezhang/TCVOM

Abstract:
It is common in graphic design humans visually arrange various elements according to their design intent and semantics. For example, a title text almost always appears on top of other elements in a document. In this work, we generate graphic layouts that can flexibly incorporate such design semantics, either specified implicitly or explicitly by a user. We optimize using the latent space of an off-the-shelf layout generation model, allowing our approach to be complementary to and used with existing layout generation models. Our approach builds on a generative layout model based on a Transformer architecture, and formulates the layout generation as a constrained optimization problem where design constraints are used for element alignment, overlap avoidance, or any other user-specified relationship. We show in the experiments that our approach is capable of generating realistic layouts in both constrained and unconstrained generation tasks with a single model. The code is available at https://github.com/ktrk115/const_layout.

Abstract:
Although deep learning based image compression methods have achieved promising progress these days, the performance of these methods still cannot match the latest compression standard Versatile Video Coding (VVC). Most of the recent developments focus on designing a more accurate and flexible entropy model that can better parameterize the distributions of the latent features. However, few efforts are devoted to structuring a better transformation between the image space and the latent feature space. In this paper, instead of employing previous autoencoder style networks to build this transformation, we propose an enhanced Invertible Encoding Network with invertible neural networks (INNs) to largely mitigate the information loss problem for better compression. Experimental results on the Kodak, CLIC, and Tecnick datasets show that our method outperforms the existing learned image compression methods and compression standards, including VVC (VTM 12.1), especially for high-resolution images. Our source code is available at https://github.com/xyq7/InvCompress.

Abstract:
Portrait images often suffer from undesirable shadows cast by casual objects or even the face itself. While existing methods for portrait shadow removal require training on a large-scale synthetic dataset, we propose the first unsupervised method for portrait shadow removal without any training data. Our key idea is to leverage the generative facial priors embedded in the off-the-shelf pretrained StyleGAN2. To achieve this, we formulate the shadow removal task as a layer decomposition problem: a shadowed portrait image is constructed by the blending of a shadow image and a shadow-free image. We propose an effective progressive optimization algorithm to learn the decomposition process. Our approach can also be extended to portrait tattoo removal and watermark removal. Qualitative and quantitative experiments on a real-world portrait shadow dataset demonstrate that our approach achieves comparable performance with supervised shadow removal methods. Our source code is available at https://github.com/YingqingHe/Shadow-Removal-via-Generative-Priors.

Abstract:
Recently, with the advance of deep Convolutional Neural Networks (CNNs), person Re-Identification (Re-ID) has witnessed great success in various applications.However, with limited receptive fields of CNNs, it is still challenging to extract discriminative representations in a global view for persons under non-overlapped cameras.Meanwhile, Transformers demonstrate strong abilities of modeling long-range dependencies for spatial and sequential data.In this work, we take advantages of both CNNs and Transformers, and propose a novel learning framework named Hierarchical Aggregation Transformer (HAT) for image-based person Re-ID with high performance.To achieve this goal, we first propose a Deeply Supervised Aggregation (DSA) to recurrently aggregate hierarchical features from CNN backbones.With multi-granularity supervision, the DSA can enhance multi-scale features for person retrieval, which is very different from previous methods.Then, we introduce a Transformer-based Feature Calibration (TFC) to integrate low-level detail information as the global prior for high-level semantic information.The proposed TFC is inserted to each level of hierarchical features, resulting in great performance improvements.To our best knowledge, this work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.Comprehensive experiments on four large-scale Re-ID benchmarks demonstrate that our method shows better results than several state-of-the-art methods.The code is released at https://github.com/AI-Zhpp/HAT.

Abstract:
Despite the success of single-domain person re-identification (ReID), current supervised models degrade dramatically when deployed to unseen domains, mainly due to the discrepancy across cameras. To tackle this issue, we propose an Adversarial Disentangling Learning (ADL) framework to decouple camera-related and ID-related features, which can be readily used for camera-agnostic person ReID. ADL adopts a discriminative way instead of the mainstream generative styles in disentangling methods, eg., GAN or VAE based, because for person ReID task only the information to discriminate IDs is needed, and more information to generate images are redundant and may be noisy. Specifically, our model involves a feature separation module that encodes images into two separate feature spaces and a disentangled feature learning module that performs adversarial training to minimize mutual information. We design an effective solution to approximate and minimize mutual information by transforming it into a discrimination problem. The two modules are co-designed to obtain strong generalization ability by only using source dataset. Extensive experiments on three public benchmarks show that our method outperforms the state-of-the-art generalizable person ReID model by a large margin. Our code is publicly available at https://github.com/luckyaci/ADL_ReID.

Abstract:
Translating e-commercial product descriptions, a.k.a product-oriented machine translation (PMT), is essential to serve e-shoppers all over the world. However, due to the domain specialty, the PMT task is more challenging than traditional machine translation problems. Firstly, there are many specialized jargons in the product description, which are ambiguous to translate without the product image. Secondly, product descriptions are related to the image in more complicated ways than standard image descriptions, involving various visual aspects such as objects, shapes, colors or even subjective styles. Moreover, existing PMT datasets are small in scale to support the research. In this paper, we first construct a large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images. To effectively learn semantic alignments among product images and bilingual texts in translation, we design a unified product-oriented cross-modal cross-lingual model for pre-training and fine-tuning. Experiments on the Fashion-MMT and Multi30k datasets show that our model significantly outperforms the state-of-the-art models even pre-trained on the same dataset. It is also shown to benefit more from large-scale noisy data to improve the translation quality. We will release the dataset and codes at https://github.com/syuqings/Fashion-MMT.

Abstract:
Transformer has achieved great success in computer vision, while how to split patches in an image remains a problem. Existing methods usually use a fixed-size patch embedding which might destroy the semantics of objects. To address this problem, we propose a new Deformable Patch (DePatch) module which learns to adaptively split the images into patches with different positions and scales in a data-driven way rather than using predefined fixed patches. In this way, our method can well preserve the semantics in patches. The DePatch module can work as a plug-and-play module, which can easily be incorporated into different transformers to achieve an end-to-end training. We term this DePatch-embedded transformer as Deformable Patch-based Transformer (DPT) and conduct extensive evaluations of DPT on image classification and object detection. Results show DPT can achieve 81.8% top-1 accuracy on ImageNet classification, and 43.7% box AP with RetinaNet, 44.3% with Mask R-CNN on MSCOCO object detection. Code has been made available at: https://github.com/CASIA-IVA-Lab/DPT.

Abstract:
Domain adaptive semantic segmentation is recognized as a promising technique to alleviate the domain shift between the labeled source domain and the unlabeled target domain in many real-world applications, such as automatic pilot. However, large amounts of source domain data often introduce significant costs in storage and training, and sometimes the source data is inaccessible due to privacy policies. To address these problems, we investigate domain adaptive semantic segmentation without source data, which assumes that the model is pre-trained on the source domain, and then adapting to the target domain without accessing source data anymore. Since there is no supervision from the source domain data, many self-training methods tend to fall into the winner-takes-all dilemma, where the majority classes totally dominate the segmentation networks and the networks fail to classify the minority classes. Consequently, we propose an effective framework for this challenging problem with two components: positive learning and negative learning. In positive learning, we select the class-balanced pseudo-labeled pixels with intra-class threshold, while in negative learning, for each pixel, we investigate which category the pixel does not belong to with the proposed heuristic complementary label selection. Notably, our framework can be easily implemented and incorporated with other methods to further enhance the performance. Extensive experiments on two widely-used synthetic-to-real benchmarks demonstrate our claims and the effectiveness of our framework, which outperforms the baseline with a large margin. Code is available at https://github.com/fumyou13/LDBE.

Abstract:
Video-text retrieval is an important yet challenging task in vision-language understanding, which aims to learn a joint embedding space where related video and text instances are close to each other. Most current works simply measure the video-text similarity based on video-level and text-level embeddings. However, the neglect of more fine-grained or local information causes the problem of insufficient representation. Some works exploit the local details by disentangling sentences, but overlook the corresponding videos, causing the asymmetry of video-text representation. To address the above limitations, we propose a Hierarchical Alignment Network (HANet) to align different level representations for video-text matching. Specifically, we first decompose video and text into three semantic levels, namely event (video and text), action (motion and verb), and entity (appearance and noun). Based on these, we naturally construct hierarchical representations in the individual-local-global manner, where the individual level focuses on the alignment between frame and word, local level focuses on the alignment between video clip and textual context, and global level focuses on the alignment between the whole video and text. Different level alignments capture fine-to-coarse correlations between video and text, as well as take the advantage of the complementary information among three semantic levels. Besides, our HANet is also richly interpretable by explicitly learning key semantic concepts. Extensive experiments on two public datasets, namely MSR-VTT and VATEX, show the proposed HANet outperforms other state-of-the-art methods, which demonstrates the effectiveness of hierarchical representation and alignment. Our code is publicly available at https://github.com/Roc-Ng/HANet.

Abstract:
Multi-view subspace clustering has received widespread attention to effectively fuse multi-view information among multimedia applications. Considering that most existing approaches' cubic time complexity makes it challenging to apply to realistic large-scale scenarios, some researchers have addressed this challenge by sampling anchor points to capture distributions in different views. However, the separation of the heuristic sampling and clustering process leads to weak discriminate anchor points. Moreover, the complementary multi-view information has not been well utilized since the graphs are constructed independently by the anchors from the corresponding views. To address these issues, we propose a Scalable Multi-view Subspace Clustering with Unified Anchors (SMVSC). To be specific, we combine anchor learning and graph construction into a unified optimization framework. Therefore, the learned anchors can represent the actual latent data distribution more accurately, leading to a more discriminative clustering structure. Most importantly, the linear time complexity of our proposed algorithm allows the multi-view subspace clustering approach to be applied to large-scale data. Then, we design a four-step alternative optimization algorithm with proven convergence. Compared with state-of-the-art multi-view subspace clustering methods and large-scale oriented methods, the experimental results on several datasets demonstrate that our SMVSC method achieves comparable or better clustering performance much more efficiently. The code of SMVSC is available at https://github.com/Jeaninezpp/SMVSC.

Abstract:
Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level interactions and frame-level information from complex spatio-temporal data to generate semantic-rich captions. Our main contribution is to identify three key problems in a joint framework for future video summarization tasks. 1) Enhanced Object Proposal: we propose a novel Conditional Graph that can fuse spatio-temporal information into latent object proposal. 2) Visual Knowledge: Latent Proposal Aggregation is proposed to dynamically extract visual words with higher semantic levels. 3) Sentence Validation: A novel Discriminative Language Validator is proposed to verify generated captions so that key semantic concepts can be effectively preserved. Our experiments on two public datasets (MVSD and MSR-VTT) manifest significant improvements over state-of-the-art approaches on all metrics, especially for BLEU-4 and CIDEr. Our code is available at https://github.com/baiyang4/D-LSG-Video-Caption.

Abstract:
Albeit existing deep learning-based image de-raining methods have achieved promising results, most of them only extract single scale features, and neglect the fact that similar rain streaks appear repeatedly across different scales. Therefore, this paper aims to explore the cross-scale cues in a multi-scale fashion. Specifically, we first introduce an adaptive-kernel pyramid to provide effective multi-scale information. Then, we design two cross-scale similarity attention blocks (CSSABs) to search spatial and channel relationships between two scales, respectively. The spatial CSSAB explores the spatial similarity between pixels of cross-scale features, while the channel CSSAB emphasizes the interdependencies among cross-scale features. To further improve the diversity of features, we adopt the wavelet transformation and multi-head mechanism in CSSABs to generate multifocal features which focus on different areas. Finally, based on our CSSABs, we construct an effective multifocal attention-based cross-scale network, which exhaustively utilizes the cross-scale correlations of both rain streaks and background, to achieve image de-raining. Experiments show the superiority of our network over state-of-the-art image de-raining approaches both qualitatively and quantitatively. The source code and pre-trained models are available at https://github.com/zhangzheyu0/Multifocal_derain.

Abstract:
Due to the remarkable progress in recent years, deep face recognition is in great need of public support for practical model production and further exploration. The demands are in three folds, including 1) modular training scheme, 2) standard and automatic evaluation, and 3) groundwork of deployment. To meet these demands, we present a novel open-source project, named FaceX-Zoo, which is constructed with modular and scalable design, and oriented to the academic and industrial community of face-related analysis. FaceX-Zoo provides 1) the training module with various choices of backbone and supervisory head; 2) the evaluation module that enables standard and automatic test on most popular benchmarks; 3) the module of simple yet fully functional face SDK for the validation and primary application of end-to-end face recognition; 4) the additional module that integrates a group of useful tools. Based on these easy-to-use modules, FaceX-Zoo can help the community to easily build stateof-the-art solutions for deep face recognition and, such like the newly-emerged challenge of masked face recognition caused by the worldwide COVID-19 pandemic. Besides, FaceX-Zoo can be easily upgraded and scaled up along with further exploration in face related fields. The source codes and models have been released and received over 900 stars at https://github.com/JDAI-CV/FaceX-Zoo.

Abstract:
Mapping a truncated optimization method into a deep neural network, deep unfolding network (DUN) has attracted growing attention in compressive sensing (CS) due to its good interpretability and high performance. Each stage in DUNs corresponds to one iteration in optimization. By understanding DUNs from the perspective of the human brain's memory processing, we find there exists two issues in existing DUNs. One is the information between every two adjacent stages, which can be regarded as short-term memory, is usually lost seriously. The other is no explicit mechanism to ensure that the previous stages affect the current stage, which means memory is easily forgotten. To solve these issues, in this paper, a novel DUN with persistent memory for CS is proposed, dubbed Memory-Augmented Deep Unfolding Network (MADUN). We design a memory-augmented proximal mapping module (MAPMM) by combining two types of memory augmentation mechanisms, namely High-throughput Short-term Memory (HSM) and Cross-stage Long-term Memory (CLM). HSM is exploited to allow DUNs to transmit multi-channel short-term memory, which greatly reduces information loss between adjacent stages. CLM is utilized to develop the dependency of deep information across cascading stages, which greatly enhances network representation capability. Extensive CS experiments on natural and MR images show that with the strong ability to maintain and balance information our MADUN outperforms existing state-of-the-art methods by a large margin. The source code is available at https://github.com/jianzhangcs/MADUN/.

Abstract:
As the conventional activation functions such as ReLU, LeakyReLU, and PReLU, the negative parts in feature maps are simply truncated or linearized, which may result in unflexible structure and undesired information distortion. In this paper, we propose a simple but effective Bilateral Activation Mechanism (BAM) which could be applied to the activation function to offer an efficient feature extraction model. Based on BAM, the Bilateral ReLU Residual Block (BRRB) that still sufficiently keeps the nonlinear characteristic of ReLU is constructed to separate the feature maps into two parts, i.e., the positive and negative components, then adaptively represent and extract the features by two independent convolution layers. Besides, our mechanism will not increase any extra parameters or computational burden in the network. We finally embed the BRRB into a basic ResNet architecture, called BRResNet, it is easy to obtain state-of-the-art performance in two image fusion tasks, i.e., pansharpening and hyperspectral image super-resolution (HISR). Additionally, deeper analysis and ablation study demonstrate the effectiveness of BAM, the lightweight property of the network, etc. Please find the code from the project page1 https://liangjiandeng.github.io/Projects_Res/bam_mm2021.html

Abstract:
This paper presents a policy-driven sequential image augmentation approach for image-related tasks. Our approach applies a sequence of image transformations (e.g., translation, rotation) over a training image, one transformation at a time, with the augmented image from the previous time step treated as the input for the next transformation. This sequential data augmentation substantially improves sample diversity, leading to improved test performance, especially for data-hungry models (e.g., deep neural networks). However, the search for the optimal transformation of each image at each time step of the sequence has high complexity due to its combination nature. To address this challenge, we formulate the search task as a sequential decision process and introduce a deep policy network that learns to produce transformations based on image content. We also develop an iterative algorithm to jointly train a classifier and the policy network in the reinforcement learning setting. The immediate reward of a potential transformation is defined to encourage transformations producing hard samples for the current classifier. At each iteration, we employ the policy network to augment the training dataset, train a classifier with the augmented data, and train the policy net with the aid of the classifier. We apply the above approach to both public image classification benchmarks and a newly collected image dataset for material recognition. Comparisons to alternative augmentation approaches show that our policy-driven approach achieves comparable or improved classification performance while using significantly fewer augmented images. The code is available at https://github.com/Paul-LiPu/rl_autoaug.

Abstract:
Relation extraction (RE) is a fundamental process in constructing knowledge graphs. However, previous methods on relation extraction suffer sharp performance decline in short and noisy social media texts due to a lack of contexts. Fortunately, the related visual contents (objects and their relations) in social media posts can supplement the missing semantics and help to extract relations precisely. We introduce the multimodal relation extraction (MRE), a task that identifies textual relations with visual clues. To tackle this problem, we present a large-scale dataset which contains 15000+ sentences with 23 pre-defined relation categories. Considering that the visual relations among objects are corresponding to textual relations, we develop a dual graph alignment method to capture this correlation for better performance. Experimental results demonstrate that visual contents help to identify relations more precisely against the text-only baselines. Besides, our alignment method can find the correlations between vision and language, resulting in better performance. Our dataset and code are available at https://github.com/thecharm/Mega.

Abstract:
Recently, convolutional neural networks (CNNs) have been widely employed to promote the face hallucination due to the ability to predict high-frequency details from a large number of samples. However, most of them fail to take into account the overall facial profile and fine texture details simultaneously, resulting in reduced naturalness and fidelity of the reconstructed face, and further impairing the performance of downstream tasks (e.g., face detection, facial recognition). To tackle this issue, we propose a novel external-internal split attention group (ESAG), which encompasses two paths responsible for facial structure information and facial texture details, respectively. By fusing the features from these two paths, the consistency of facial structure and the fidelity of facial details are strengthened at the same time. Then, we propose a split-attention in split-attention network (SISN) to reconstruct photorealistic high-resolution facial images by cascading several ESAGs. Experimental results on face hallucination and face recognition unveil that the proposed method not only significantly improves the clarity of hallucinated faces, but also encourages the subsequent face recognition performance substantially. Codes have been released at https://github.com/mdswyz/SISN-Face-Hallucination.

Abstract:
The egocentric video provides a unique view of event participants to show their attention, vision, and interaction with objects. In this paper, we introduce Ego-Deliver, a new large-scale egocentric video benchmark recorded by takeaway riders about their daily work. To the best of our knowledge, Ego-Deliver presents the first attempt in understanding activities from the takeaway delivery process while being one of the largest egocentric video action datasets to date. Our dataset provides a total of 5,360 videos with more than 139,000 multi-track annotations and 45 different attributes, which we believe is pivotal to future research in this area. We introduce the FS-Net architecture, a new anchor-free action detection approach handling extreme variations of action durations. We partition videos into fragments and build dynamic graphs over fragments, where multi-fragment context information is aggregated to boost fragment classification. A splicing and scoring module is applied to obtain final action proposals. Our experimental evaluation confirms that the proposed framework outperforms existing approaches on the proposed Ego-Deliver benchmark and is competitive on other popular benchmarks. In our current version, Ego-Deliver is used to make a comprehensive comparison between algorithms for activity detection. We also show its application to action recognition with promising results. The dataset, toolkits and baseline results will be made available at: https://egodeliver.github.io/EgoDeliver_Dataset/

Abstract:
We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation which considers an uncertain area of the saliency map. We construct a modified version of U-Net shape network with additional encoder and decoder and compute a saliency map in each bottom-up stream prediction module and propagate to the next prediction module. In each prediction module, previously predicted saliency map is utilized to compute foreground, background and uncertain area map and we aggregate the feature map with three area maps for each representation. Then we compute the relation between each representation and each pixel in the feature map. We conduct experiments on five popular polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and CVC-300, and our method achieves state-of-the-art performance. Especially, we achieve 76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous state-of-the-art method. Source code is publicly available at https://github.com/plemeri/UACANet

Abstract:
Video question answering~(Video-QA) is a task of answering a natural language question related to the content of a video. Existing methods generally explore the single interactions between objects or between frames, which are insufficient to deal with the sophisticated scenes in videos. To tackle this problem, we propose a novel model, termed Progressive Graph Attention Network (PGAT), which can jointly explore the multiple visual relations on object-level, frame-level and clip-level. Specifically, in the object-level relation encoding, we design two kinds of complementary graphs, one for learning the spatial and semantic relations between objects from the same frame, the other for modeling the temporal relations between the same object from different frames. The frame-level graph explores the interactions between diverse frames to record the fine-grained appearance change, while the clip-level graph models the temporal and semantic relations between various actions from clips. These different-level graphs are concatenated in a progressive manner to learn the visual relations from low-level to high-level. Furthermore, we for the first time identified that there are serious answer biases with TGIF-QA, a very large Video-QA dataset, and reconstructed a new dataset based on it to overcome the biases, called TGIF-QA-R. We evaluate the proposed model on three benchmark datasets and the new TGIF-QA-R, and the experimental results demonstrate that our model significantly outperforms other state-of-the-art models. Our codes and dataset are available at https://github.com/PengLiang-cn/PGAT.

Abstract:
Over the past few years, several studies have been conducted on text-to-image synthesis techniques, which transfer input textual descriptions into realistic images. However, facial image synthesis and manipulation from input sentences have not been widely explored due to the lack of datasets. My research interests center around the development of multi-modality technology and facial image generation with Generative Adversarial Networks. Towards that end, we propose an approach for facial image generation and manipulation from text descriptions. We also introduce the first Text-to-Face synthesis dataset with large-scale facial attributes. In this extended abstract, we first present the existing condition and further direction of my Ph.D. research that I have followed during the first year. Then, we introduce the proposed method (accepted by IEEE FG2021), annotated novel dataset and experimental results. Finally, the future outlook on other challenges, proposed dataset and expected impact are discussed. Codes and paper lists studied in text-to-image synthesis are summarized on https://github.com/Yutong-Zhou-cv/Awesome-Text-to-Image.

Abstract:
We present MMFashion, a comprehensive, flexible and user-friendly open-source visual fashion analysis toolbox based on PyTorch. This toolbox supports a wide spectrum of fashion analysis tasks, including Fashion Attribute Prediction, Fashion Recognition and Retrieval, Fashion Landmark Detection, Fashion Parsing and Segmentation and Fashion Compatibility and Recommendation. It covers almost all the mainstream tasks in fashion analysis community. MMFashion has several appealing properties. Firstly, MMFashion follows the principle of modular design. The framework is decomposed into different components so that it is easily extensible for diverse customized modules. In addition, detailed documentations, demo scripts and off-the-shelf models are available, which ease the burden of layman users to leverage the recent advances in deep learning-based fashion analysis. Our proposed MMFashion is currently the most complete platform for visual fashion analysis in deep learning era, with more functionalities to be added. This toolbox and the benchmark could serve the flourishing research community by providing a flexible toolkit to deploy existing models and develop new ideas and approaches. We welcome all contributions to this still-growing efforts towards open science: https://github.com/open-mmlab/mmfashion.

Abstract:
Blind assessment of video quality is still challenging even in this deep learning era. The limited number of samples in existing databases is insufficient to learn a good feature extractor for video quality assessment (VQA), while manually labeling a larger database with subjective perception is very labor-intensive and time-consuming. To relieve such difficulty, we first collect 3589 high-quality video clips as the reference and build a large VQA dataset. The dataset contains more than 300K samples degraded by various distortion types due to compression and transmission error, and provides weak labels for each distorted sample with several full-reference VQA algorithms. To learn effective representation from the weakly labeled data, we alleviate the bias of single weak label (i.e., single knowledge) via learning from multiple heterogeneous knowledge. To this end, we propose a novel no-reference VQA (NR-VQA) method with HEterogeneous Knowledge Ensemble (HEKE). Comparing to learning from single knowledge, HEKE can theoretically reach a lower infimum, and learn richer representation due to the heterogeneity. Extensive experimental results show that the proposed HEKE outperforms existing NR-VQA methods, and achieves the state-of-the-art performance. The source code will be available at https://github.com/Sissuire/BVQA-HEKE.

Abstract:
The key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the visual and linguistic features as the previous works. The semantic tags are the object tags and the action tags detected in videos, which can be viewed as partial captions for the input video. To effectively exploit the semantic tags, we design a Semantic Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic features with X-Linear Attention based cross-attention modules. Moreover, tag related tasks are also designed in the pre-training stage to aid the model more fruitfully exploits the cross-modal information. The proposed model reaches the 5th place in the pre-training for video captioning challenge with the help of the semantic tags. Our codes will be available at: https://github.com/RubickH/ST-XlanV.

Abstract:
Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years. In this paper, we apply the state-of-the-art video object tracklet detection pipeline MEGA[7] and deepSORT [27] to generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations. Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware decoder which performs feature interactions between the tracklets and learnable predicate query embeddings, and finally predicts the relations. Experimental results strongly demonstrate the superiority of our method, which outperforms other methods by a large margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.

Abstract:
In recent years, assessing action quality from videos has attracted growing attention in computer vision community and human-computer interaction. Most existing approaches usually tackle this problem by directly migrating the model from action recognition tasks, which ignores the intrinsic differences within the feature map such as foreground and background information. To address this issue, we propose a Tube Self-Attention Network (TSA-Net) for action quality assessment (AQA). Specifically, we introduce a single object tracker into AQA and propose the Tube Self-Attention Module (TSA), which can efficiently generate rich spatio-temporal contextual information by adopting sparse feature interactions. The TSA module is embedded in existing video networks to form TSA-Net. Overall, our TSA-Net is with the following merits: 1) High computational efficiency, 2) High flexibility, and 3) The state-of-the-art performance. Extensive experiments are conducted on popular action quality assessment datasets including AQA-7 and MTL-AQA. Besides, a dataset named Fall Recognition in Figure Skating (FR-FS) is proposed to explore the basic action assessment in the figure skating scene. Our TSA-Net achieves the Spearman's Rank Correlation of 0.8476 and 0.9393 on AQA-7 and MTL-AQA, respectively, which are the new state-of-the-art results. The results on FR-FS also verify the effectiveness of the TSA-Net. The code and FR-FS dataset are publicly available at https://github.com/Shunli-Wang/TSA-Net.

Abstract:
Change captioning aims to describe the differences in image pairs with natural language. It is an interesting task under-explored with two main challenges: describing the relative position relationship between objects correctly and overcoming the disturbances from viewpoint changes. To address these issues, we propose a three-dimensional (3D) information aware Scene Graph based Change Captioning (SGCC) model. We extract the semantic attributes of objects and the 3D information of images (i.e., depths of objects, relative two-dimensional image plane distances, and relative angles between objects) to construct the scene graphs for image pairs, then aggregate the nodes representations with a graph convolutional network. Owing to the relative position relationships between objects and the scene graphs, our model thereby is capable of assisting observers to locate the changed objects quickly and being immune to the viewpoint change to some extent. Extensive experiments show that our SGCC model achieves competitive performance with the state-of-the-art models on the CLEVR-Change and Spot-the-Diff datasets, thus verifying the effectiveness of our proposed model. Codes are available at https://github.com/VISLANG-Lab/SGCC.

Abstract:
Vision and Language Navigation (VLN) requires an agent to navigate to a target location by following natural language instructions. Most of existing works represent a navigation candidate by the feature of the corresponding single view where the candidate lies in. However, an instruction may mention landmarks out of the single view as references, which might lead to failures of textual-visual matching of existing methods. In this work, we propose a multi-module Neighbor-View Enhanced Model (NvEM) to adaptively incorporate visual contexts from neighbor views for better textual-visual matching. Specifically, our NvEM utilizes a subject module and a reference module to collect contexts from neighbor views. The subject module fuses neighbor views at a global level, and the reference module fuses neighbor objects at a local level. Subjects and references are adaptively determined via attention mechanisms. Our model also includes an action module to utilize the strong orientation guidance (e.g., "turn left'') in instructions. Each module predicts navigation action separately and their weighted sum is used for predicting the final action. Extensive experimental results demonstrate the effectiveness of the proposed method on the R2R and R4R benchmarks against several state-of-the-art navigators, and NvEM even beats some pre-training ones. Our code is available at https://github.com/MarSaKi/NvEM.

Abstract:
Conventional frame-based cameras for multimedia computing have encountered important challenges in high-speed and extreme light scenarios. However, how to design a novel paradigm for visual perception that overcomes the disadvantages of conventional cameras still remains an open issue. In this paper, we propose a novel solution, namely retinomorphic sensing, which integrates fovea-like and peripheral-like sampling mechanisms to generate asynchronous visual streams using a unified representation as the retina does. Technically, our encoder incorporates an interaction controller to switch flexibly between dynamic and static sensing. Then, the decoder effectively extracts dynamic events for machine vision and reconstructs visual textures for human vision. The results show that our strategy enables it to sense dynamic events and visual textures meanwhile reduce data redundancy. We further build a prototype hybrid camera system to verify this strategy on vision tasks such as image reconstruction and object detection. We believe that this novel paradigm will provide insight into future multimedia computing. The code can be available at https://github.com/acmmm2021-bni-retinomorphic/retinomorphic-sensing.

Abstract:
Zero-shot voice conversion (VC) trained by non-parallel data has gained a lot of attention in recent years. Previous methods usually extract speaker embeddings from audios and use them for converting the voices into different voice styles. Since there is a strong relationship between human faces and voices, a promising approach would be to synthesize various voice characteristics from face representation. Therefore, we introduce a novel idea of generating different voice styles from different human face photos, which can facilitate new applications, e.g., personalized voice assistants. However, the audio-visual relationship is implicit. Moreover, the existing VCs are trained on laboratory-collected datasets without speaker photos, while the datasets with both photos and audios are in-the-wild datasets. Directly replacing the target audio with the target photo and training on the in-the-wild dataset leads to noisy results. To address these issues, we propose a novel many-to-many voice conversion network, namely Face-based Voice Conversion (FaceVC), with a 3-stage training strategy. Quantitative and qualitative experiments on the LRS3-Ted dataset show that the proposed FaceVC successfully performs voice conversion according to the target face photos. Audio samples can be found on the demo website at https://facevc.github.io/.

Abstract:
Multi-scale features fusion plays a critical role in salient object detection. Most of existing methods have achieved remarkable performance by exploiting various multi-scale features fusion strategies. However, an elegant fusion framework requires expert knowledge and experience, heavily relying on laborious trial and error. In this paper, we propose a multi-scale features fusion framework based on Neural Architecture Search (NAS), named Auto-MSFNet. First, we design a novel search cell, named FusionCell to automatically decide multi-scale features aggregation. Rather than searching one repeatable cell stacked, we allow different FusionCells to flexibly integrate multi-level features. Simultaneously, considering features generated from CNNs are naturally spatial and channel-wise, we propose a new search space for efficiently focusing on the most relevant information. The search space mitigates incomplete object structures or over-predicted foreground regions caused by progressive fusion. Second, we propose a progressive polishing loss to further obtain exquisite boundaries by penalizing misalignment of salient object boundaries. Extensive experiments on five benchmark datasets demonstrate the effectiveness of the proposed method and achieve state-of-the-art performance on four evaluation metrics. The code and results of our method are available at https://github.com/OIPLab-DUT/Auto-MSFNet.

Abstract:
Motion prediction is a classic problem in computer vision, which aims at forecasting future motion given the observed pose sequence. Various deep learning models have been proposed, achieving state-of-the-art performance on motion prediction. However, existing methods typically focus on modeling temporal dynamics in the pose space. Unfortunately, the complicated and high dimensionality nature of human motion brings inherent challenges for dynamic context capturing. Therefore, we move away from the conventional pose based representation and present a novel approach employing a phase space trajectory representation of individual joints. Moreover, current methods tend to only consider the dependencies between physically connected joints. In this paper, we introduce a novel convolutional neural model to effectively leverage explicit prior knowledge of motion anatomy, and simultaneously capture both spatial and temporal information of joint trajectory dynamics. We then propose a global optimization module that learns the implicit relationships between individual joint features. Empirically, our method is evaluated on large-scale 3D human motion benchmark datasets (i.e., Human3.6M, CMU MoCap). These results demonstrate that our method sets the new state-of-the-art on the benchmark datasets. Our code is released at https://github.com/Pose-Group/TEID.

Abstract:
It is desirable to maintain both high accuracy and runtime efficiency in lane detection. State-of-the-art methods mainly address the efficiency problem by direct compression of high-dimensional features. These methods usually suffer from information loss and cannot achieve satisfactory accuracy performance. To ensure the diversity of features and subsequently maintain information as much as possible, we introduce multi-frequency analysis into lane detection. Specifically, we propose a multi-spectral feature compressor (MSFC) based on two-dimensional (2D) discrete cosine transform (DCT) to compress features while preserving diversity information. We group features and associate each group with an individual frequency component, which incurs only 1/7 overhead of one-dimensional convolution operation but preserves more information. Moreover, to further enhance the discriminability of features, we design a multi-spectral lane feature aggregator (MSFA) based on one-dimensional (1D) DCT to aggregate features from each lane according to their corresponding frequency components. The proposed method outperforms the state-of-the-art methods (including LaneATT and UFLD) on TuSimple, CULane, and LLAMAS benchmarks. For example, our method achieves 76.32% F1 at 237 FPS and 76.98% F1 at 164 FPS on CULane, which is 1.23% and 0.30% higher than LaneATT. Our code and models are available at https://github.com/harrylin-hyl/MSLD.

Abstract:
In this paper, we introduce a new Multimodal Entity Linking (MEL) task on the multimodal data. The MEL task discovers entities in multiple modalities and various forms within large-scale multimodal data and maps multimodal mentions in a document to entities in a structured knowledge base such as Wikipedia. Different from the conventional Neural Entity Linking (NEL) task that focuses on textual information solely, MEL aims at achieving human-level disambiguation among entities in images, texts, and knowledge bases. Due to the lack of sufficient labeled data for the MEL task, we release a large-scale multimodal entity linking dataset M3EL (abbreviated for MultiModal Movie Entity Linking). Specifically, we collect reviews and images of 1,100 movies, extract textual and visual mentions, and label them with entities registered in Wikipedia. In addition, we construct a new baseline method to solve the MEL problem, which models the alignment of textual and visual mentions as a bipartite graph matching problem and solves it with an optimal-transportation-based linking method. Extensive experiments on the M3EL dataset verify the quality of the dataset and the effectiveness of the proposed method. We envision this work to be helpful for soliciting more research effort and applications regarding multimodal computing and inference in the future. We make the dataset and the baseline algorithm publicly available at https://jingrug.github.io/research/M3EL.

Abstract:
To address the point cloud quality assessment (PCQA) problem, GraphSIM was proposed via jointly considering geometrical and color features, which shows compelling performance in multiple distortion detection. However, GraphSIM does not take into account the mutiscale characteristics of human perception. In this paper, we propose a multiscale PCQA model, called Multiscale Graph Similarity (MS-GraphSIM), that can better predict human subjective perception. First, exploring the multiscale processing method used in image processing, we introduce a multiscale representation of point clouds based on graph signal processing. Second, we extend GraphSIM into multiscale version based on the proposed multiscale representation. Specifically, MS-GraphSIM constructs a multiscale representation for each local patch extracted from the reference point cloud or the distorted point cloud, and then fuses GraphSIM at different scales to obtain an overall quality score. Experiment results demonstrate that the proposed MS-GraphSIM outperforms the state-of-the-art PCQA metrics over two fairly large and independent databases. Ablation studies further prove the proposed MS-GraphSIM is robust to different model hyperparameter settings. The code is available at https://github.com/zyj1318053/MS_GraphSIM.

Abstract:
Multiview detection incorporates multiple camera views to deal with occlusions, and its central problem is multiview aggregation. Given feature map projections from multiple views onto a common ground plane, the state-of-the-art method addresses this problem via convolution, which applies the same calculation regardless of object locations. However, such translation-invariant behaviors might not be the best choice, as object features undergo various projection distortions according to their positions and cameras. In this paper, we propose a novel multiview detector, MVDeTr, that adopts a newly introduced shadow transformer to aggregate multiview information. Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions. We propose an effective training scheme that includes a new view-coherent data augmentation method, which applies random augmentations while maintaining multiview consistency. On two multiview detection benchmarks, we report new state-of-the-art accuracy with the proposed system. Code is available at https://github.com/hou-yz/MVDeTr.

Abstract:
Detection transformers have recently shown promising object detection results and attracted increasing attention. However, how to develop effective domain adaptation techniques to improve its cross-domain performance remains unexplored and unclear. In this paper, we delve into this topic and empirically find that direct feature distribution alignment on the CNN backbone only brings limited improvements, as it does not guarantee domain-invariant sequence features in the transformer for prediction. To address this issue, we propose a novel Sequence Feature Alignment (SFA) method that is specially designed for the adaptation of detection transformers. Technically, SFA consists of a domain query-based feature alignment (DQFA) module and a token-wise feature alignment (TDA) module. In DQFA, a novel domain query is used to aggregate and align global context from the token sequence of both domains. DQFA reduces the domain discrepancy in global feature representations and object relations when deploying in the transformer encoder and decoder, respectively. Meanwhile, TDA aligns token features in the sequence from both domains, which reduces the domain gaps in local and instance-level feature representations in the transformer encoder and decoder, respectively. Besides, a novel bipartite matching consistency loss is proposed to enhance the feature discriminability for robust object detection. Experiments on three challenging benchmarks show that SFA outperforms state-of-the-art domain adaptive object detection methods. Code has been made available at: https://github.com/encounter1997/SFA.

Abstract:
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.

Abstract:
We tackle the problem of object completion from point clouds and propose a novel point cloud completion network employing an Asymmetrical Siamese Feature Matching strategy, termed as ASFM-Net. Specifically, the Siamese auto-encoder neural network is adopted to map the partial and complete input point cloud into a shared latent space, which can capture detailed shape prior. Then we design an iterative refinement unit to generate complete shapes with fine-grained details by integrating prior information. Experiments are conducted on the PCN dataset and the Completion3D benchmark, demonstrating the state-of-the-art performance of the proposed ASFM-Net. Our method achieves the 1st place in the leaderboard of Completion3D and outperforms existing methods with a large margin, about 12%. The codes and trained models are released publicly at https://github.com/Yan-Xia/ASFM-Net.

Abstract:
The popularity and promotion of depth maps have brought new vigor and vitality into salient object detection (SOD), and a mass of RGB-D SOD algorithms have been proposed, mainly concentrating on how to better integrate cross-modality features from RGB image and depth map. For the cross-modality interaction in feature encoder, existing methods either indiscriminately treat RGB and depth modalities, or only habitually utilize depth cues as auxiliary information of the RGB branch. Different from them, we reconsider the status of two modalities and propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD, which differentially models the dependence of two modalities according to the feature representations of different layers. To this end, two components are designed to implement the effective cross-modality interaction: 1) the RGB-induced Detail Enhancement (RDE) module leverages RGB modality to enhance the details of the depth features in low-level encoder stage. 2) the Depth-induced Semantic Enhancement (DSE) module transfers the object positioning and internal consistency of depth features to the RGB branch in high-level encoder stage. Furthermore, we also design a Dense Decoding Reconstruction (DDR) structure, which constructs a semantic block by combining multi-level encoder features to upgrade the skip connection in the feature decoding. Extensive experiments on five benchmark datasets demonstrate that our network outperforms 15 state-of-the-art methods both quantitatively and qualitatively. Our code is publicly available at:https://rmcong.github.io/proj_CDINet.html.

Abstract:
Low-light image enhancement (LLIE) is a pervasive yet challenging problem, since: 1) low-light measurements may vary due to different imaging conditions in practice; 2) images can be enlightened subjectively according to diverse preference by each individual. To tackle these two challenges, this paper presents a novel deep reinforcement learning based method, dubbed ReLLIE, for customized low-light enhancement. ReLLIE models LLIE as a markov decision process, i.e., estimating the pixel-wise image-specific curves sequentially and recurrently. Given the reward computed from a set of carefully crafted non-reference loss functions, a lightweight network is proposed to estimate the curves for enlightening of a low-light image input. As ReLLIE learns a policy instead of one-one image translation, it can handle various low-light measurements and provide customized enhanced outputs by flexibly applying the policy different times. Furthermore, ReLLIE can enhance real-world images with hybrid corruptions, i.e., noise, by using a plug-and-play denoiser easily. Extensive experiments on various benchmarks demonstrate the advantages of ReLLIE, comparing to the state-of-the-art methods. (Code is available: https://github.com/GuoLanqing/ReLLIE.)

Abstract:
While widely adopted in practical applications, face recognition has been disputed on the malicious use of face images and potential privacy issues. Online photo sharing services accidentally act as the main approach for the malicious crawlers to exploit face recognition to access portrait privacy. In this demo, we propose an adversarial privacy-preserving filter, which can preserve face image from malicious face recognition algorithms. This filter is generated by an end-cloud collaborated adversarial attack framework consisting of three modules: (1) Image-specific gradient generation module, to extract image-specific gradient in the user end; (2) Adversarial gradient transfer module, to fine-tune the image-specific gradient in the server; and (3) Universal adversarial perturbation enhancement module, to append image-independent perturbation to derive the final adversarial perturbation. A short video about our system is available at https://github.com/Anonymity-for-submission/3247.

Abstract:
Unsupervised domain adaptation (UDA) for semantic segmentation aims to adapt a segmentation model trained on the labeled source domain to the unlabeled target domain. Existing methods try to learn domain invariant features while suffering from large domain gaps that make it difficult to correctly align discrepant features, especially in the initial training phase. To address this issue, we propose a novel Dual Soft-Paste (DSP) method in this paper. Specifically, DSP selects some classes from a source domain image using a long-tail class first sampling strategy and softly pastes the corresponding image patch on both the source and target training images with a fusion weight. Technically, we adopt the mean teacher framework for domain adaptation, where the pasted source and target images go through the student network while the original target image goes through the teacher network. Output-level alignment is carried out by aligning the probability maps of the target fused image from both networks using a weighted cross-entropy loss. In addition, feature-level alignment is carried out by aligning the feature maps of the source and target images from student network using a weighted maximum mean discrepancy loss. DSP facilitates the model learning domain-invariant features from the intermediate domains, leading to faster convergence and better performance. Experiments on two challenging benchmarks demonstrate the superiority of DSP over state-of-the-art methods. Code is available at https://github.com/GaoLii/DSP.

Abstract:
Model pruning aims to reduce the deep neural network (DNN) model size or computational overhead. Traditional model pruning methods such as l-1 pruning that evaluates the channel significance for DNN pay too much attention to the local analysis of each channel and make use of the magnitude of the entire feature while ignoring its relevance to the batch normalization (BN) and ReLU layer after each convolutional operation. To overcome these problems, we propose a new model pruning method from a new perspective of gradient flow in this paper. Specifically, we first theoretically analyze the channel's influence based on Taylor expansion by integrating the effects of BN layer and ReLU activation function. Then, the incorporation of the first-order Talyor polynomial of the scaling parameter and the shifting parameter in the BN layer is suggested to effectively indicate the significance of a channel in a DNN. Comprehensive experiments on both image classification and image denoising tasks demonstrate the superiority of the proposed novel theory and scheme. Code is available at https://github.com/CityU-AIM-Group/GFBS.

Abstract:
The CNN-based image inpainting methods have achieved promising performance because of its outstanding semantic understanding and reasoning potentialities. However, previous works could not get satisfied results in some situations because information is not fully explored. In this paper, we propose a new method by combining three innovative ideas. First, to increase the diversity of the semantic information obtained by the network in image synthesis, we propose a multiple hidden space perceptual (MHSP) loss, which extracts high-level features from multiple pre-trained autoencoders. Second, we adopt an adaptive iterative reasoning (AIR) stategy to reduce the calculations under small-hole circumstances while ensuring the performance in large-hole circumstances. Third, we find that color inconsistencies occasionally occurred in the final image merging process, so we add a novel interval maximum saturation (IMS) loss to the final loss function. Experiments on the benchmark datasets show our method performs favorably against state-of-the-art approaches. Code is made publicly available at: https://github.com/IC-LAB/adaptive_iterative_inpainting.

Abstract:
Text-guided image inpainting aims to complete the corrupted patches coherent with both visual and textual context. On one hand, existing works focus on surrounding pixels of the corrupted patches without considering the objects in the image, resulting in the characteristics of objects described in text being painted on non-object regions. On the other hand, the redundant information in text may distract the generation of objects of interest in the restored image. In this paper, we propose an adversarial learning framework with mask reconstruction (ALMR) for image inpainting with textual guidance, which consists of a two-stage generator and dual discriminators. The two-stage generator aims to restore coarse-grained and fine-grained images, respectively. In particular, we devise a dual-attention module (DAM) to incorporate the word-level and sentence-level textual features as guidance on generating the coarse-grained and fine-grained details in the two stages. Furthermore, we design a mask reconstruction module (MRM) to penalize the restoration of the objects of interest with the given textual descriptions about the objects. For adversarial training, we exploit global and local discriminators for the whole image and corrupted patches, respectively. Extensive experiments conducted on CUB-200-2011, Oxford-102 and CelebA-HQ show the outperformance of the proposed ALMR (e.g., FID value is reduced from 29.69 to 14.69 compared with the state-of-the-art approach on CUB-200-2011). Codes are available at https://github.com/GaranWu/ALMR

Abstract:
There is a growing demand for online fitness due to the impact of the epidemic. This paper presents a real-time online fitness system framework called AICoacher, which offers different online coaches. The framework constructs an extensible AI-based architecture that supports a variety of fitness movements. Firstly, key frames of motion are extracted automatically, and the feature vectors are calculated with the body pose points. Secondly, the state transition matrix can effectively identify fitness actions and capture their time-continuous characteristics. Finally, AICoacher can accurately provide the number of repetitions and correction tips of fitness movements. Currently, the AICoacher has a number of fitness courses supported by online coaches and has been tested on hundreds of fitness movements. The code can be downloaded from https://github.com/liutiel/AICoacher.

Abstract:
With the rise and development of deep learning over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the state-of-the-art of cross-modal analytics between vision and language in multimedia field. Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion. In this work, we propose X-modaler --- a versatile and high-performance codebase that encapsulates the state-of-the-art cross-modal analytics into several general-purpose stages (e.g., pre-processing, encoder, cross-modal interaction, decoder, and decode strategy). Each stage is empowered with the functionality that covers a series of modules widely adopted in state-of-the-arts and allows seamless switching in between. This way naturally enables a flexible implementation of state-of-the-art algorithms for image captioning, video captioning, and vision-language pre-training, aiming to facilitate the rapid development of research community. Meanwhile, since the effective modular designs in several stages (e.g., cross-modal interaction) are shared across different vision-language tasks, X-modaler can be simply extended to power startup prototypes for other tasks in cross-modal analytics, including visual question answering, visual commonsense reasoning, and cross-modal retrieval. X-modaler is an Apache-licensed codebase, and its source codes, sample projects and pre-trained models are available on-line: https://github.com/YehLi/xmodaler.

Abstract:
We propose a new efficient framework, the Unified Context Network (UniCon), for robust active speaker detection (ASD). Traditional methods for ASD usually operate on each candidate's pre-cropped face track separately and do not sufficiently consider the relationships among the candidates. This potentially limits performance, especially in challenging scenarios with low-resolution faces, multiple candidates, etc. Our solution is a novel, unified framework that focuses on jointly modeling multiple types of contextual information: spatial context to indicate the position and scale of each candidate's face, relational context to capture the visual relationships among the candidates and contrast audio-visual affinities with each other, and temporal context to aggregate long-term information and smooth out local uncertainties. Based on such information, our model optimizes all candidates in a unified process for robust and reliable ASD. A thorough ablation study is performed on several challenging ASD benchmarks under different settings. In particular, our method outperforms the state-of-the-art by a large margin of about 15% mean Average Precision (mAP) absolute on two challenging subsets: one with three candidate speakers, and the other with faces smaller than 64 pixels. Together, our UniCon achieves 92.0% mAP on the AVA-ActiveSpeaker validation set, surpassing 90% for the first time on this challenging dataset at the time of submission. Project website: https://unicon-asd.github.io/.

Abstract:
Multi-view clustering (MVC) has been extensively studied to collect multiple source information in recent years. One typical type of MVC methods is based on matrix factorization to effectively perform dimension reduction and clustering. However, the existing approaches can be further improved with following considerations: i) The current one-layer matrix factorization framework cannot fully exploit the useful data representations. ii) Most algorithms only focus on the shared information while ignore the view-specific structure leading to suboptimal solutions. iii) The partition level information has not been utilized in existing work. To solve the above issues, we propose a novel multi-view clustering algorithm via deep matrix decomposition and partition alignment. To be specific, the partition representations of each view are obtained through deep matrix decomposition, and then are jointly utilized with the optimal partition representation for fusing multi-view information. Finally, an alternating optimization algorithm is developed to solve the optimization problem with proven convergence. The comprehensive experimental results conducted on six benchmark multi-view datasets clearly demonstrates the effectiveness of the proposed algorithm against the SOTA methods. The code address for this algorithm is https://github.com/ZCtalk/MVC-DMF-PA.

Abstract:
For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects is the important cue to understand the contextual information presented in the video. With the efficient spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame, but to directly capture inter-frame dependencies as well. Capturing the position changes of human and objects over the spatio-temporal dimension is more critical when significant changes in the appearance features may not occur over time. When utilizing appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) intra-frame relations: modeling the interactions between human and the interacted objects within each frame. (ii) inter-frame relations: capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN. Code for STIGPN is available at https://github.com/GuangmingZhu/STIGPN.

Abstract:
A recent study [4] finds that existing few-shot learning methods, trained on the source domain, fail to generalize to the novel target domain when a domain gap is observed. This motivates the task of Cross-Domain Few-Shot Learning (CD-FSL). In this paper, we realize that the labeled target data in CD-FSL has not been leveraged in any way to help the learning process. Thus, we advocate utilizing few labeled target data to guide the model learning. Technically, a novel meta-FDMixup network is proposed. We tackle this problem mainly from two aspects. Firstly, to utilize the source and the newly introduced target data of two different class sets, a mixup module is re-proposed and integrated into the meta-learning mechanism. Secondly, a novel disentangle module together with a domain classifier is proposed to extract the disentangled domain-irrelevant and domain-specific features. These two modules together enable our model to narrow the domain gap thus generalizing well to the target datasets. Additionally, a detailed feasibility and pilot study is conducted to reflect the intuitive understanding of CD-FSL under our new setting. Experimental results show the effectiveness of our new setting and the proposed method. Codes and models are available at https://github.com/lovelyqian/Meta-FDMixup.

Abstract:
Neural network pruning has shown promising performance in reducing computational complexity and facilitate the deployment of deep neural networks on resource-limited edge devices. Most existing pruning methods focus on the indicators of the filter's weight, gradient, or feature map and regard the weak or similar filters as network redundancy. In contrast, the representation of discriminative power is also a fundamental attribute that analog neural networks to have extraordinary performance in various tasks. However, such representation is neglected in existing works. Alternatively, we propose a novel filter pruning strategy via class-wise discriminative power (CDP). Unlike the previous methods, CDP treats the filters that always yield large or small activation values as redundant and reserves the filters that show different magnitudes in activations as they yield high discriminative power. We further propose to obtain such discriminative power by employing the widely-used Term Frequency-Inverse Document Frequency (TF-IDF) on feature representations across classes. Specifically, the output of a filter is considered as a word, and the whole feature map is considered as a document. Then, TF-IDF is used to generate the relevant score between words and all documents. If a filter has low TF-IDF scores is less discriminate and can be pruned. Thus, the filters with high TF-IDF scores are reserved. To our best knowledge, this is the first work that prunes neural networks through class-wise discriminative power and measures such power by introducing TF-IDF in feature representation among different classes. Without any iterative process, CDP achieves better compression trade-offs comparing to the state-of-the-art compression algorithms. For instance, in VGG-16, we achieve a 68.05%-FLOPs reduction, with a 94.86% Top-1 accuracy on CIFAR-10. Specifically, we compress a 90.12%-FLOPs reduction VGG-16, even retains 93.30% Top-1 accuracy on CIFAR-10. The code is available at https://github.com/Tianshuo-Xu/CDP-Towards-Optimal-Filter-Pruning-via-Class-wise-Discriminative-Power.git

Abstract:
Generative adversarial networks (GANs) have been extensively used for training networks that perform image generation. After training, the discriminator in GAN was not used anymore. We propose to recycle the trained discriminator for another use: no-reference image quality assessment (NR-IQA). We are motivated by twofold facts. First, in Wasserstein GAN (WGAN), the discriminator is designed to calculate the distance between the distribution of generated images and that of real images; thus, the trained discriminator may encode the distribution of real-world images. Second, NR-IQA often needs to leverage the distribution of real-world images for assessing image quality. We then conjecture that using the trained discriminator for NR-IQA may help get rid of any human-labeled quality opinion scores and lead to a new opinion-unaware (OU) method. To validate our conjecture, we start from a restricted NR-IQA problem, that is IQA for artificially super-resolved images. We train super-resolution (SR) WGAN with two kinds of discriminators: one is to directly evaluate the entire image, and the other is to work on small patches. For the latter kind, we obtain patch-wise quality scores, and then have the flexibility to fuse the scores, e.g., by weighted average. Moreover, we directly extend the trained discriminators for authentically distorted images that have different kinds of distortions. Our experimental results demonstrate that the proposed method is comparable to the state-of-the-art OU NR-IQA methods on SR images and is even better than them on authentically distorted images. Our method provides a better interpretable approach to NR-IQA. Our code and models are available at https://github.com/YunanZhu/RecycleD.

Abstract:
Image inpainting aims to restore the missing regions of corrupted images and make the recovery result identical to the originally complete image, which is different from the common generative task emphasizing the naturalness or realism of generated images. Nevertheless, existing works usually regard it as a pure generation problem and employ cutting-edge deep generative techniques to address it. The generative networks can fill the main missing parts with realistic contents but usually distort the local structures or introduce obvious artifacts. In this paper, for the first time, we formulate image inpainting as a mix of two problems, i.e., predictive filtering and deep generation. Predictive filtering is good at preserving local structures and removing artifacts but falls short to complete the large missing regions. The deep generative network can fill the numerous missing pixels based on the understanding of the whole scene but hardly restores the details identical to the original ones. To make use of their respective advantages, we propose the joint predictive filtering and generative network (JPGNet) that contains three branches: predictive filtering & uncertainty network (PFUNet), deep generative network, and uncertainty-aware fusion network (UAFNet). The PFUNet can adaptively predict pixel-wise kernels for filtering-based inpainting according to the input image and output an uncertainty map. This map indicates the pixels should be processed by filtering or generative networks, which is further fed to the UAFNet for a smart combination between filtering and generative results. Note that, our method as a novel framework for the image inpainting problem can benefit any existing generation-based methods. We validate our method on three public datasets, i.e., Dunhuang, Places2, and CelebA, and demonstrate that our method can enhance three state-of-the-art generative methods (i.e., StructFlow, EdgeConnect, and RFRNet) significantly with slightly extra time costs. We have released the code at https://github.com/tsingqguo/jpgnet.

Abstract:
RGB-D salient object detection (SOD) recently has attracted increasing research interest by benefiting conventional RGB SOD with extra depth information. However, existing RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy, which hinders their potential applications on mobile devices and real-world problems. An underlying challenge is that the model accuracy usually degrades when the model is simplified to have few parameters. To tackle this dilemma and also inspired by the fact that depth quality is a key factor influencing the accuracy, we propose a novel depth quality-inspired feature manipulation (DQFM) process, which is efficient itself and can serve as a gating mechanism for filtering depth features to greatly boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for further efficiency consideration. Extensive experimental results demonstrate that our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient models, and meanwhile runs at 140ms on CPU (2.2x faster than the prior fastest efficient model) with only ~8.5Mb model size (14.9% of the prior lightest). Our code will be available at https://github.com/zwbx/DFM-Net.

Abstract:
We present MMOCR---an open-source toolbox which provides a comprehensive pipeline for text detection and recognition, as well as their downstream tasks such as named entity recognition and key information extraction. MMOCR implements 14 state-of-the-art algorithms, which is significantly more than all the existing open-source OCR projects we are aware of to date. To facilitate future research and industrial applications of text recognition-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of text detection, recognition and understanding. MMOCR is publicly released at https://github.com/open-mmlab/mmocr.

Abstract:
Machine learning approaches for more efficient video compression have been developed thanks to breakthroughs in deep learning. However, they typically bring coding improvements at the cost of significant increases in computational complexity, making them largely unsuitable for practical applications. In this paper, we present open-source software for convolutional neural network-based solutions which improve the interpolation of reference samples needed for fractional precision motion compensation. Contrary to previous efforts, the networks are fully linear, allowing them to be interpreted, with a full interpolation filter set derived from trained models, making it simple to integrate in conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full neural networks, while achieving notable coding efficiency improvements on lower resolution video sequences. The open-source software package is available at https://github.com/bbc/cnn-fractional-motion-compensation under the 3-clause BSD license.

Abstract:
Efficient and light-weight super resolution (SR) is highly demanded in practical applications. However, most of the existing studies focusing on reducing the number of model parameters and FLOPs may not necessarily lead to faster running speed on mobile devices. In this work, we propose a re-parameterizable building block, namely Edge-oriented Convolution Block (ECB), for efficient SR design. In the training stage, the ECB extracts features in multiple paths, including a normal 3 x 3 convolution, a channel expanding-and-squeezing convolution, and 1st-order and 2nd-order spatial derivatives from intermediate features. In the inference stage, the multiple operations can be merged into one single 3 3 convolution. ECB can be regarded as a drop-in replacement to improve the performance of normal 3 3 convolution without introducing any additional cost in the inference stage. We then propose an extremely efficient SR network for mobile devices based on ECB, namely ECBSR. Extensive experiments across five benchmark datasets demonstrate the effectiveness and efficiency of ECB and ECBSR. Our ECBSR achieves comparable PSNR/SSIM performance to state-of-the-art light-weight SR models, while it can super resolve images from 270p/540p to 1080p in real-time on commodity mobile devices, e.g., Snapdragon 865 SOC and Dimensity 1000+ SOC. The source code can be found at https://github.com/xindongzhang/ECBSR.

Abstract:
Existing video copy detection methods generally measure video similarity based on spatial similarities between key frames, neglecting the latent similarity in temporal dimension, so that the video similarity is biased towards spatial information. There are methods modeling unified video similarity in an end-to-end way, but losing detailed partial alignment information, which causes the incapability of copy segments localization. To address the above issues, we propose the Video Similarity and Alignment Learning (VSAL) approach, which jointly models spatial similarity, temporal similarity and partial alignment. To mitigate the spatial similarity bias, we model the temporal similarity as the mask map predicted from frame-level spatial similarity, where each element indicates the probability of frame pair lying right on the partial alignments. To further localize partial copies, the step map is learned from the spatial similarity where the elements indicate extending directions of the current partial alignments on the spatial-temporal similarity map. Obtained from the mask map, the start points extend out into partial optimal alignments following instructions of the step map. With the similarity and alignment learning strategy, VSAL achieves the state-of-the-art F1-score on VCDB core dataset. Furthermore, we construct a new benchmark of partial video copy detection and localization by adding new segment-level annotations for FIVR-200k dataset, where VSAL also achieves the best performance, verifying its effectiveness in more challenging situations. Our project is publicly available at https://pvcd-vsal.github.io/vsal/.

Abstract:
Spotting facial micro-expression from videos finds various potential applications in fields including clinical diagnosis and interrogation, meanwhile this task is still difficult due to the limited scale of training data. To solve this problem, this paper tries to formulate a new task called micro-expression generation and then presents a strong baseline which combines the first order motion model with facial prior knowledge. Given a target face, we intend to drive the face to generate micro-expression videos according to the motion patterns of source videos. Specifically, our new model involves three modules. First, we extract facial prior features from a region focusing module. Second, we estimate facial motion using key points and local affine transformations with a motion prediction module. Third, expression generation module is used to drive the target face to generate videos. We train our model on public CASME II, SAMM and SMIC datasets and then use the model to generate new micro-expression videos for evaluation. Our model achieves the first place in the Facial Micro-Expression Challenge 2021 (MEGC2021), where our superior performance is verified by three experts with Facial Action Coding System certification. Source code is provided in https://github.com/Necolizer/Facial-Prior-Based-FOMM.

Abstract:
Many GAN inversion methods have emerged to embed a given real image into the latent space of GAN for real image editing. These methods usually use a latent space composed of a series of one-dimensional vectors as an optimization space to reconstruct real images such as W+ latent space. However, the reconstructed image of these methods is usually difficult to maintain the rich detailed information in the real image. How to better preserve details in the real image is still a challenge. To solve this problem, we propose a spatially-adaptive latent space, called SA latent space, and adopt it as the optimization latent space in GAN inversion task. In particular, we use the affine transformation parameters of each convolutional layer in the generator to form the SA latent space and change affine transformation parameters from a one-dimensional vector to a spatially-adaptive three-dimensional tensor. With the more expressive latent space, we can better reconstruct the details of the real image. Extensive experiments suggest that the image reconstruction quality can be significantly improved while maintaining the semantic disentanglement ability of latent code. The code is available at https://github.com/zhang-lingyun/SalS-GAN.

Abstract:
As cameras are increasingly deployed in new application domains such as autonomous driving, performing 3D object detection on monocular images becomes an important task for visual scene understanding. Recent advances on monocular 3D object detection mainly rely on the "pseudo-LiDAR'' generation, which performs monocular depth estimation and lifts the 2D pixels to pseudo 3D points. However, depth estimation from monocular images, due to its poor accuracy, leads to inevitable position shift of pseudo-LiDAR points within the object. Therefore, the predicted bounding boxes may suffer from inaccurate location and deformed shape. In this paper, we present a novel neighbor-voting method that incorporates neighbor predictions to ameliorate object detection from severely deformed pseudo-LiDAR point clouds. Specifically, each feature point around the object forms their own predictions, and then the "consensus'' is achieved through voting. In this way, we can effectively combine the neighbors' predictions with local prediction and achieve more accurate 3D detection. To further enlarge the difference between the foreground region of interest (ROI) pseudo-LiDAR points and the background points, we also encode the ROI prediction scores of 2D foreground pixels into the corresponding pseudo-LiDAR points. We conduct extensive experiments on the KITTI benchmark to validate the merits of our proposed method. Our results on the bird's eye view detection outperform the state-of-the-art performance, especially for the "hard" level detection. The code is available at https://github.com/cxmomo/Neighbor-Vote.

Abstract:
Most of the existing single-stage and two-stage 3D object detectors are anchor-based methods, while the efficient but challenging anchor-free single-stage 3D object detection is not well investigated. Recent studies on 2D object detection show that the anchor-free methods also are of great potential. However, the unordered and sparse properties of point clouds prevent us from directly leveraging the advanced 2D methods on 3D point clouds. We overcome this by converting the voxel-based sparse 3D feature volumes into the sparse 2D feature maps. We propose an attentive module to fit the sparse feature maps to dense mostly on the object regions through the deformable convolution tower and the supervised mask-guided attention. By directly regressing the 3D bounding box from the enhanced and dense feature maps, we construct a novel single-stage 3D detector for point clouds in an anchor-free manner. We propose an IoU-based detection confidence re-calibration scheme to improve the correlation between the detection confidence score and the accuracy of the bounding box regression. Our code is publicly available at https://github.com/jialeli1/MGAF-3DSSD.

Abstract:
For face presentation attack detection (PAD), most of the spoofing cues are subtle, local image patterns (e.g., local image distortion, 3D mask edge and cut photo edges). The representations of existing PAD works with simple global pooling method, however, lose the local feature discriminability. In this paper, the VLAD aggregation method is adopted to quantize local features with visual vocabulary locally partitioning the feature space, and hence preserve the local discriminability. We further propose the vocabulary separation and adaptation method to modify VLAD for cross-domain PAD task. The proposed vocabulary separation method divides vocabulary into domain-shared and domain-specific visual words to cope with the diversity of live and attack faces under the cross-domain scenario.The proposed vocabulary adaptation method imitates the maximization step of the k-means algorithm in the end-to-end training, which guarantees the visual words be close to the center of assigned local features and thus brings robust similarity measurement. We give illustrations and extensive experiments to demonstrate the effectiveness of VLAD with the proposed vocabulary separation and adaptation method on standard cross-domain PAD benchmarks. The codes are available at https://github.com/Liubinggunzu/VLAD-VSA.

Abstract:
Person re-identification via 3D skeletons is an emerging topic with great potential in security-critical applications. Existing methods typically learn body and motion features from the body-joint trajectory, whereas they lack a systematic way to model body structure and underlying relations of body components beyond the scale of body joints. In this paper, we for the first time propose a Self-supervised Multi-scale Skeleton Graph Encoding (SM-SGE) framework that comprehensively models human body, component relations, and skeleton dynamics from unlabeled skeleton graphs of various scales to learn an effective skeleton representation for person Re-ID. Specifically, we first devise multi-scale skeleton graphs with coarse-to-fine human body partitions, which enables us to model body structure and skeleton dynamics at multiple levels. Second, to mine inherent correlations between body components in skeletal motion, we propose a multi-scale graph relation network to learn structural relations between adjacent body-component nodes and collaborative relations among nodes of different scales, so as to capture more discriminative skeleton graph features. Last, we propose a novel multi-scale skeleton reconstruction mechanism to enable our framework to encode skeleton dynamics and high-level semantics from unlabeled skeleton graphs, which encourages learning a discriminative skeleton representation for person Re-ID. Extensive experiments show that SM-SGE outperforms most state-of-the-art skeleton-based methods. We further demonstrate its effectiveness on 3D skeleton data estimated from large-scale RGB videos. Our codes are open at https://github.com/Kali-Hac/SM-SGE.

Abstract:
Location and appearance are the key cues for video object segmentation. Many sources such as RGB, depth, optical flow and static saliency can provide useful information about the objects. However, existing approaches only utilize the RGB or RGB and optical flow. In this paper, we propose a novel multi-source fusion network for zero-shot video object segmentation. With the help of interoceptive spatial attention module (ISAM), spatial importance of each source is highlighted. Furthermore, we design a feature purification module (FPM) to filter the inter-source incompatible features. By the ISAM and FPM, the multi-source features are effectively fused. In addition, we put forward an automatic predictor selection network (APS) to select the better prediction of either the static saliency predictor or the moving object predictor in order to prevent over-reliance on the failed results caused by low-quality optical flow maps. Extensive experiments on three challenging public benchmarks (i.e. DAVIS_16 , Youtube-Objects and FBMS) show that the proposed model achieves compelling performance against the state-of-the-arts. The source code will be publicly available at https://github.com/Xiaoqi-Zhao-DLUT/Multi-Source-APS-ZVOS

Abstract:
Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in complex images. We achieve state-of-the-art performance on two multimodal Twitter datasets without modifying the internals of the language model to accept multimodal data, demonstrating the effectiveness of our translation. In addition, we explain a failure mode of a popular approach for aspect sentiment analysis when applied to tweets. Our code is available at https://github.com/codezakh/exploiting-BERT-thru-translation.

Abstract:
In this companion paper, we provide details of the artifacts to support the replication of "Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework", which was presented at MM'20. The Inter-intra Contrastive (IIC) framework aims to extract more discriminative temporal information by extending intra-negative samples in contrastive self-supervised learning. In this paper, we first summarize our contribution. Then we explain the file structure of the source code and detailed settings. Since our proposal is a framework which contain a lot of different settings, we provide some custom settings to help other researchers to use our methods easily. The source code is available at https://github.com/BestJuly/IIC.

Abstract:
This companion paper is provided to describe the major experiments reported in our paper "On Learning Disentangled Representation for Acoustic Event Detection" published in ACM Multimedia 2019. To make the replication of our work easier, we first give an introduction of the computing environment where all of our experiments are conducted. Furthermore, we provide an environmental configuration file to setup the compiling environment and other artifacts including the source code, datasets and the files generated during our experiments. Finally, we summarize the structure and usage of the source code. For more details, please consult the README file in the archive of artifacts on GitHub: https://github.com/mastergofujs/SED_PyTorch.

Abstract:
Person re-identification (Re-ID) aims to match person images across non-overlapping camera views. The majority of Re-ID methods focus on small-scale surveillance systems in which each pedestrian is captured in different camera views of adjacent scenes. However, in large-scale surveillance systems that cover larger areas, it is required to track a pedestrian of interest across distant scenes (e.g., a criminal suspect escapes from one city to another). Since most pedestrians appear in limited local areas, it is difficult to collect training data with cross-camera pairs of the same person. In this work, we study intra-camera supervised person re-identification across distant scenes (ICS-DS Re-ID), which uses cross-camera unpaired data with intra-camera identity labels for training. It is challenging as cross-camera paired data plays a crucial role for learning camera-invariant features in most existing Re-ID methods. To learn camera-invariant representation from cross-camera unpaired training data, we propose a cross-camera feature prediction method to mine cross-camera self supervision information from camera-specific feature distribution by transforming fake cross-camera positive feature pairs and minimize the distances of the fake pairs. Furthermore, we automatically localize and extract local-level feature by a transformer. Joint learning of global-level and local-level features forms a global-local cross-camera feature prediction scheme for mining fine-grained cross-camera self supervision information. Finally, cross-camera self supervision and intra-camera supervision are aggregated in a framework. The experiments are conducted in the ICS-DS setting on Market-SCT, Duke-SCT and MSMT17-SCT datasets. The evaluation results demonstrate the superiority of our method, which gains significant improvements of 15.4 Rank-1 and 22.3 mAP on Market-SCT as compared to the second best method. Our code is available at https://github.com/g3956/CCFP.

Abstract:
A number of deep learning based algorithms have been proposed to recover high-quality videos from low-quality compressed ones. Among them, some restore the missing details of each frame via exploring the spatiotemporal information of neighboring frames. However, these methods usually suffer from a narrow temporal scope, thus may miss some useful details from some frames outside the neighboring ones. In this paper, to boost artifact removal, on the one hand, we propose a Recursive Fusion (RF) module to model the temporal dependency within a long temporal range. Specifically, RF utilizes both the current reference frames and the preceding hidden state to conduct better spatiotemporal compensation. On the other hand, we design an efficient and effective Deformable Spatiotemporal Attention (DSTA) module such that the model can pay more effort on restoring the artifact-rich areas like the boundary area of a moving object. Extensive experiments show that our method outperforms the existing ones on the MFQE 2.0 dataset in terms of both fidelity and perceptual effect. Code is available at https://github.com/zhaominyiz/RFDA-PyTorch.

Abstract:
To obtain good performance, convolutional neural networks are usually over-parameterized. This phenomenon has stimulated two interesting topics: pruning the unimportant weights for compression and reactivating the unimportant weights to make full use of network capability. However, current weight reactivation methods usually reactivate the entire filters, which may not be precise enough. Looking back in history, the prosperity of filter pruning is mainly due to its friendliness to hardware implementation, but pruning at a finer structure level, i.e., weight elements, usually leads to better network performance. We study the problem of weight element reactivation in this paper. Motivated by evolution, we select the unimportant filters and update their unimportant elements by combining them with the important elements of important filters, just like gene crossover to produce better offspring, and the proposed method is called weight evolution (WE). WE is mainly composed of four strategies. We propose a global selection strategy and a local selection strategy and combine them to locate the unimportant filters. A forward matching strategy is proposed to find the matched important filters and a crossover strategy is proposed to utilize the important elements of the important filters for updating unimportant filters. WE is plug-in to existing network architectures. Comprehensive experiments show that WE outperforms the other reactivation methods and plug-in training methods with typical convolutional neural networks, especially lightweight networks. Our code is available at https://github.com/BZQLin/Weight-evolution.

Abstract:
In this paper, we propose AdvHash, the first targeted mismatch attack on deep hashing through adversarial patch. After superimposed with the same adversarial patch, any query image with a chosen label will retrieve a set of irrelevant images with the target label. Concretely, we first formulate a set-to-set problem, where a set of samples are pushed into a predefined clustered area in the Hamming space. Then we obtain a target anchor hash code and transform the attack to a set-to-point optimization. In order to generate a image-agnostic stable adversarial patch for a chosen label more efficiently, we propose a product-based weighted gradient aggregation strategy to dynamically adjust the gradient directions of the patch, by exploiting the Hamming distances between training samples and the target anchor hash code and assigning different weights to discriminatively aggregate gradients. Extensive experiments on benchmark datasets verify that AdvHash is highly effective at attacking two state-of-the-art deep hashing schemes. Our codes are available at: https://github.com/CGCL-codes/AdvHash.

Abstract:
Existing image captioning methods just focus on understanding the relationship between objects or instances in a single image, without exploring the contextual correlation existed among contextual image. In this paper, we propose Dual Graph Convolutional Networks (Dual-GCN) with transformer and curriculum learning for image captioning. In particular, we not only use an object-level GCN to capture the object to object spatial relation within a single image, but also adopt an image-level GCN to capture the feature information provided by similar images. With the well-designed Dual-GCN, we can make the linguistic transformer better understand the relationship between different objects in a single image and make full use of similar images as auxiliary information to generate a reasonable caption description for a single image. Meanwhile, with a cross-review strategy introduced to determine difficulty levels, we adopt curriculum learning as the training strategy to increase the robustness and generalization of our proposed model. We conduct extensive experiments on the large-scale MS COCO dataset, and the experimental results powerfully demonstrate that our proposed method outperforms recent state-of-the-art approaches. It achieves a BLEU-1 score of 82.2 and a BLEU-2 score of 67.6. Our source code is available at https://github.com/Unbear430/DGCN-for-image-captioning.

Abstract:
Lots of convolutional neural network (CNN)-based methods have been proposed to implement face completion with regular holes. However, in practical applications, irregular holes are more common to see. Moreover, due to the distinct attributes and large variation of appearance for human faces, it is more challenging to fill irregular holes in face images while keeping content consistent with the rest region. Since facial attributes (e.g., gender, smiling, pointy nose, etc.) allow for a more understandable description of one face, they can provide some hints that benefit the face completion task. In this work, we propose a novel attributes-guided face completion network (AttrFaceNet), which comprises a facial attribute prediction subnet and a face completion subnet. The attribute prediction subnet predicts facial attributes from the rest parts of the corrupted images and guides the face completion subnet to fill the missing regions. The proposed AttrFaceNet is evaluated in an end-to-end way on commonly used datasets CelebA and Helen. Extensive experimental results show that our method outperforms state-of-the-art methods qualitatively and quantitatively especially in large mask size cases. Code is available at https://github.com/FVL2020/AttrFaceNet.

Abstract:
Cross-domain Facial Expression Recognition (FER) is challenging due to the difficulty of concurrently handling the domain shift and semantic gap during domain adaptation. Existing methods mainly focus on reducing the domain discrepancy for transferable features but fail to decrease the semantic one, which may result in negative transfer. To this end, we propose Joint Discriminative and Mutual Adaptation Networks (JDMAN), which collaboratively bridge the domain shift and semantic gap by domain- and category-level co-adaptation based on mutual information and discriminative metric learning techniques. Specifically, we design a mutual information minimization module for domain-level adaptation, which narrows the domain shift by simultaneously distilling the domain-invariant components and eliminating the untransferable ones lying in different domains. Moreover, we propose a semantic metric learning module for category-level adaptation, which can close the semantic discrepancy during discriminative intra-domain representation learning and transferable inter-domain knowledge discovery. These two modules are jointly leveraged in our JDMAN to safely transfer the source knowledge to target data in an end-to-end manner. Extensive experimental results on six databases show that our method achieves state-of-the-art performance. The code of our JDMAN is available at https://github.com/YingjianLi/JDMAN.

Abstract:
Recent years have witnessed a surge of professional user-generated content (PUGC) based video services, coinciding with the accelerated proliferation of video acquisition devices such as mobile phones, wearable cameras, and unmanned aerial vehicles. Different from traditional UGC videos by impromptu shooting, PUGC videos produced by professional users tend to be carefully designed and edited, receiving high popularity with a relatively satisfactory playing count. In this paper, we systematically conduct the comprehensive study on the perceptual quality of PUGC videos and introduce a database consisting of 10,000 PUGC videos with subjective ratings. In particular, during the subjective testing, we collect the human opinions based upon not only the MOS, but also the attributes that could potentially influence the visual quality including face, noise, blur, brightness, and color. We make the attempt to analyze the large-scale PUGC database with a series of video quality assessment (VQA) algorithms and a dedicated baseline model based on pretrained deep neural network is further presented. The cross-dataset experiments reveal a large domain gap between the PUGC and the traditional user-generated videos, which are critical in learning based VQA. These results shed light on developing next-generation PUGC quality assessment algorithms with desired properties including promising generalization capability, high accuracy, and effectiveness in perceptual optimization. The dataset and the codes are released at https://github.com/wlkdb/pugcq_create.

Abstract:
Video analytics with Deep Neural Networks (DNNs) empowers many vision-based applications. However, deploying DNN models for video analytics services must address the challenges of computational capacity, service delay, and cost. Leveraging the edge-cloud collaboration to address these problems has become a growing trend. This paper provides the multimedia research community with an open source framework named SmartEye for real-time video analytics by leveraging the edge-cloud collaboration. The system consists of 1) an edge layer which enables video preprocessing, model selection, on-edge inference, and task offloading; 2) a request forwarding layer which serves as a gateway of the cloud and forwards the offloaded tasks to backend workers; and 3) a backend worker layer that processes the offloaded tasks with specified DNN models. One can easily customize the policies for preprocessing, offloading, model selection, and request forwarding. The framework can facilitate research and development in this field. The project is released as an open source project on GitHub at https://github.com/MSNLAB/SmartEye.

Abstract:
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.

Abstract:
Existing blind image quality assessment (BIQA) methods have made great progress in various task-specific applications, including the synthetic, authentic, or over-enhanced distortion evaluations. However, limited by the static model and once-for-all learning strategy, they failed to perform the cross-task evaluations in many practical applications, where diverse evaluation criteria and distortion types are constantly emerging. To address this issue, in this paper, we propose a dynamic Remember and Reuse (R&R) network, which efficiently performs the cross-task BIQA based on a novel relevance-aware incremental learning strategy. Given multiple evaluation tasks across different distortion types or databases, our R&R network sequentially updates the parameters for every task one by one. After each update step, part of task-specific parameters is settled, which ensures R&R Remembers their dedicated evaluation preferences. The remaining parameters are pruned for the dynamic usage of the subsequent tasks. To further exploit the correlation between different tasks, we feed the training data of a new task to previously settled parameters. Better prediction accuracy is considered as higher task relevance and vice versa. Then, we selectively Reuse parts of previously settled parameters, whose proportion is adaptively determined by the task relevance. Extensive experiments show that the proposed method efficiently achieves the cross-task BIQA without catastrophic forgetting, and significantly outperforms many state-of-the-art methods. Code is available at https://github.com/maruiperfect/R-R-Net.

Abstract:
Depth map super-resolution is a task with high practical application requirements in the industry. Existing color-guided depth map super-resolution methods usually necessitate an extra branch to extract high-frequency detail information from RGB image to guide the low-resolution depth map reconstruction. However, because there are still some differences between the two modalities, direct information transmission in the feature dimension or edge map dimension cannot achieve satisfactory result, and may even trigger texture copying in areas where the structures of the RGB-D pair are inconsistent. Inspired by the multi-task learning, we propose a joint learning network of depth map super-resolution (DSR) and monocular depth estimation (MDE) without introducing additional supervision labels. For the interaction of two subnetworks, we adopt a differentiated guidance strategy and design two bridges correspondingly. One is the high-frequency attention bridge (HABdg) designed for the feature encoding process, which learns the high-frequency information of the MDE task to guide the DSR task. The other is the content guidance bridge (CGBdg) designed for the depth map reconstruction process, which provides the content guidance learned from DSR task for MDE task. The entire network architecture is highly portable and can provide a paradigm for associating the DSR and MDE tasks. Extensive experiments on benchmark datasets demonstrate that our method achieves competitive performance. Our code and models are available at https://rmcong.github.io/proj_BridgeNet.html.

Abstract:
Generally, a surround-view system (SVS), which is an indispensable component of advanced driving assistant systems (ADAS), consists of four to six wide-angle fisheye cameras. As long as both intrinsics and extrinsics of all cameras have been calibrated, a top-down surround-view with the real scale can be synthesized at runtime from fisheye images captured by these cameras. However, when the vehicle is driving on the road, relative poses between cameras in the SVS may change from the initial calibrated states due to bumps or collisions. In case that extrinsics' representations are not adjusted accordingly, on the surround-view, obvious geometric misalignment will appear. Currently, the researches on correcting the extrinsics of the SVS in an online manner are quite sporadic, and a mature and robust pipeline is still lacking. As an attempt to fill this research gap to some extent, in this work, we present a novel extrinsics correction pipeline designed specially for the SVS, namely ROECS (Robust Online Extrinsics Correction of the Surround-view system). Specifically, a "refined bi-camera error" model is firstly designed. Then, by minimizing the overall "bi-camera error" within a sparse and semi-direct framework, the SVS's extrinsics can be iteratively optimized and become accurate eventually. Besides, an innovative three-step pixel selection strategy is also proposed. The superior robustness and the generalization capability of ROECS are validated by both quantitative and qualitative experimental results. To make the results reproducible, the collected data and the source code have been released at https://cslinzhang.github.io/ROECS/.

Abstract:
Most of the existing 3D human pose estimation approaches mainly focus on predicting 3D positional relationships between the root joint and other human joints (local motion) instead of the overall trajectory of the human body (global motion). Despite the great progress achieved by these approaches, they are not robust to global motion, and lack the ability to accurately predict local motion with a small movement range. To alleviate these two problems, we propose a relative information encoding method that yields positional and temporal enhanced representations. Firstly, we encode positional information by utilizing relative coordinates of 2D poses to enhance the consistency between the input and output distribution. The same posture with different absolute 2D positions can be mapped to a common representation. It is beneficial to resist the interference of global motion on the prediction results. Second, we encode temporal information by establishing the connection between the current pose and other poses of the same person within a period of time. More attention will be paid to the movement changes before and after the current pose, resulting in better prediction performance on local motion with a small movement range. The ablation studies validate the effectiveness of the proposed relative information encoding method. Besides, we introduce a multi-stage optimization method to the whole framework to further exploit the positional and temporal enhanced representations. Our method outperforms state-of-the-art methods on two public datasets. Code is available at https://github.com/paTRICK-swk/Pose3D-RIE.

Abstract:
Food logo detection plays an important role in the multimedia for its wide real-world applications, such as food recommendation of the self-service shop and infringement detection on e-commerce platforms. A large-scale food logo dataset is urgently needed for developing advanced food logo detection algorithms. However, there are no available food logo datasets with food brand information. To support efforts towards food logo detection, we introduce the dataset FoodLogoDet-1500, a new large-scale publicly available food logo dataset, which has 1,500 categories, about 100,000 images and about 150,000 manually annotated food logo objects. We describe the collection and annotation process of FoodLogoDet-1500, analyze its scale and diversity, and compare it with other logo datasets. To the best of our knowledge, FoodLogoDet-1500 is the first largest publicly available high-quality dataset for food logo detection. The challenge of food logo detection lies in the large-scale categories and similarities between food logo categories. For that, we propose a novel food logo detection method Multi-scale Feature Decoupling Network (MFDNet), which decouples classification and regression into two branches and focuses on the classification branch to solve the problem of distinguishing multiple food logo categories. Specifically, we introduce the feature offset module, which utilizes the deformation-learning for optimal classification offset and can effectively obtain the most representative features of classification in detection. In addition, we adopt a balanced feature pyramid in MFDNet, which pays attention to global information, balances the multi-scale feature maps, and enhances feature extraction capability. Comprehensive experiments on FoodLogoDet-1500 and other two popular benchmark logo datasets demonstrate the effectiveness of the proposed method. The code and FoodLogoDet-1500 can be found at https://github.com/hq03/FoodLogoDet-1500-Dataset.

Abstract:
Based on the powerful feature extraction ability of deep learning architecture, recently, deep-learning based watermarking algorithms have been widely studied. The basic framework of such algorithm is the auto-encoder like end-to-end architecture with an encoder, a noise layer and a decoder. The key to guarantee robustness is the adversarial training with the differential noise layer. However, we found that none of the existing framework can well ensure the robustness against JPEG compression, which is non-differential but is an essential and important image processing operation. To address such limitations, we proposed a novel end-to-end training architecture, which utilizes Mini-Batch of Real and Simulated JPEG compression (MBRS) to enhance the JPEG robustness. Precisely, for different mini-batches, we randomly choose one of real JPEG, simulated JPEG and noise-free layer as the noise layer. Besides, we suggest to utilize the Squeeze-and-Excitation blocks which can learn better feature in embedding and extracting stage, and propose a "message processor" to expand the message in a more appreciate way. Meanwhile, to improve the robustness against crop attack, we propose an additive diffusion block into the network. The extensive experimental results have demonstrated the superior performance of the proposed scheme compared with the state-of-the-art algorithms. Under the JPEG compression with quality factor Q=50, our models achieve a bit error rate less than 0.01% for extracted messages, with PSNR larger than 36 for the encoded images, which shows the well-enhanced robustness against JPEG attack. Besides, under many other distortions such as Gaussian filter, crop, cropout and dropout, the proposed framework also obtains strong robustness. The code implemented by PyTorch is avaiable in https://github.com/jzyustc/MBRS.

Abstract:
Facial landmarks (FLM) estimation is a critical component in many face-related applications. In this work, we aim to optimize for both accuracy and speed and explore the trade-off between them. Our key observation is that not all faces are created equal. Frontal faces with neutral expressions converge faster than faces with extreme poses or expressions. To differentiate among samples, we train our model to predict the regression error after each iteration. If the current iteration is accurate enough, we stop iterating, saving redundant iterations while keeping the accuracy in check. We also observe that as neighboring patches overlap, we can infer all facial landmarks (FLMs) with only a small number of patches without a major accuracy sacrifice. Architecturally, we offer a multi-scale, patch-based, lightweight feature extractor with a fine-grained local patch attention module, which computes a patch weighting according to the information in the patch itself and enhances the expressive power of the patch features. We analyze the patch attention data to infer where the model is attending when regressing facial landmarks and compare it to face attention in humans. Our model runs in real-time on a mobile device GPU, with 95 Mega Multiply-Add (MMA) operations, outperforming all state-of-the-art methods under 1000 MMA, with a normalized mean error of 8.16 on the 300W challenging dataset. The code is available at https://github.com/ligaripash/MuSiCa

Abstract:
In this paper, we present an Intersection-over-Union (IoU) guided two-stage 3D object detector with a voxel-to-point decoder. To preserve the necessary information from all raw points and maintain the high box recall in voxel based Region Proposal Network (RPN), we propose a residual voxel-to-point decoder to extract the point features in addition to the map-view features from the voxel based RPN. We use a 3D Region of Interest (RoI) alignment to crop and align the features with the proposal boxes for accurately perceiving the object position. The RoI-Aligned features are finally aggregated with the corner geometry embeddings that can provide the potentially missing corner information in the box refinement stage. We propose a simple and efficient method to align the estimated IoUs to the refined proposal boxes as a more relevant localization confidence. The comprehensive experiments on KITTI and Waymo Open Dataset demonstrate that our method achieves significant improvements with novel architectures against the existing methods. The code is available on Github URLhttps://github.com/jialeli1/From-Voxel-to-Point .

Abstract:
Macro- and micro-expression spotting is a very challenging task to locate their occurrence intervals in long face videos. In this paper, we propose an efficient two-stream network named location suppression based spotting network (LSSNet), which includes three parts. First, the optical flow is extracted using the traditional TV-L1 algorithm which captures subtle facial movements while adding temporal information to alleviate the problem of insufficient samples. Then, fixed length features are extracted from the sampled optical flow and raw images by an I3D model, which is used to set sliding windows. Finally, location suppression modules (LSMs) are added to the pyramidal convolutional neural network (CNN) to reduce the proposals with too long and too short intervals. In addition, we use two different methods, named top_k and top_threshold, for validation. We adopt leave-one-subject-out (LOSO) to train our model on CAS(ME)2 and SAMM-LV. Experimental results show that our LSSNet achieves the state-of-the-art result with top_threshold, especially on the CAS(ME)2 dataset. The code is available at https://github.com/williamlee91/mer_spot.

Abstract:
Predicting dense depth accurately is essential for 3D scene understanding applications such as autonomous driving and robotics. However, the depth obtained from commercially available LiDAR and Time-of-Flight sensors is very sparse. With RGB color guidance, modern convolutional neural network (CNN) based approaches can recover the missing depth information. However, there could be scenarios such as low-light environments where it might be difficult to get an associated RGB image with the sparse depth. In this work, we propose a Generative Adversarial Network (GAN) that can accurately predict the dense depth using only sparse samples without any RGB inputs. Generally, the sparsity in the depth samples is uniformly distributed and cannot guarantee capturing all intricate details. In this study, we also explore different variants of sparse sampling strategies from uniform to feature based directed sampling. We find that feature based intelligent sampling enjoys better compression ratio without sacrificing intricate details, saving data communication bandwidth. Compared to uniform sampling, depending on how aggressively the directed sampling is done, we observe about 3% to 25% reduction in size. We can easily reduce the size by 8% with directed sampling without sacrificing the reconstruction accuracy. Although such directed sampling strategies are not readily available with commercially viable depth sensors, we believe that our study paves the way for future intelligent sensing and sampling strategies. To further investigate data reduction and reconstruction accuracy trade-offs we deploy our GAN to generate higher resolution dense depth from 4 times smaller sparse samples. With slight decrease in accuracy, our GAN is able to recover the depth successfully which shows great promise in edge Internet of Things (IoT) applications where we have very tight constraint on data transmission bandwidth. Our source code along with examples is available at: https://github.com/kocchop/depth-completion-gan

Abstract:
Lossy image compression always faces a tradeoff between rate-distortion performance and compression/decompression speed. With the advent of neural image compression, hardware (GPU) becomes the new vertex in the tradeoff triangle. By resolving the high GPU dependency and improving the low speed of neural models, this paper proposes two non-GPU models that get the best of the three worlds. First, the CPU-friendly Independent Separable Down-Sampling (ISD) and Up-Sampling (ISU) modules are proposed to lighten the network while ensuring a large receptive field. Second, an asymmetric autoencoder architecture is adopted to boost the decoding speed. At last, the Inverse Quantization Residual (IQR) module is proposed to reduce the error caused by quantization. In terms of rate-distortion performance, our network surpasses the state-of-the-art real-time GPU neural compression work at medium and high bit rates. In terms of speed, our model's compression and decompression speeds surpass all other traditional compression methods except JPEG, using only CPUs. In terms of hardware, the proposed models are CPU friendly and perform stably well in a non-GPU environment. The code is publicly available at https://github.com/kengchikengchi/FasiNet.

Abstract:
Meta-learning offers an effective solution to learn new concepts with scarce supervision through an episodic training scheme: a series of target-like tasks sampled from base classes are sequentially fed into a meta-learner to extract common knowledge across tasks, which can facilitate the quick acquisition of task-specific knowledge of the target task with few samples. Despite its noticeable improvements, the episodic training strategy samples tasks randomly and uniformly, without considering their hardness and quality, which may not progressively improve the meta-leaner's generalization ability. In this paper, we present a Curriculum-Based Meta-learning (CubMeta) method to train the meta-learner using tasks from easy to hard. Specifically, the framework of CubMeta is in a progressive way, and in each step, we design a module named BrotherNet to establish harder tasks and an effective learning scheme for obtaining an ensemble of stronger meta-learners. In this way, the meta-learner's generalization ability can be progressively improved, and better performance can be obtained even with fewer training tasks. We evaluate our method for few-shot classification on two benchmarks - mini-ImageNet and tiered-ImageNet, where it achieves consistent performance improvements on various meta-learning paradigms.

Abstract:
We study few-shot learning (FSL) under multi-agent scenarios, in which participating agents only have local scarce labeled data and need to collaborate to predict query data labels. Though each of the agents, such as drones and robots, has minimal communication and computation capability, we aim at designing coordination schemes such that they can collectively perceive the environment accurately and efficiently. We propose a novel metric-based multi-agent FSL framework which has three main components: an efficient communication mechanism that propagates compact and fine-grained query feature maps from query agents to support agents; an asymmetric attention mechanism that computes region-level attention weights between query and support feature maps; and a metric-learning module which calculates the image-level relevance between query and support data fast and accurately. Through analysis and extensive numerical studies, we demonstrate that our approach can save communication and computation costs and significantly improve performance in both visual and acoustic perception tasks such as face identification, semantic segmentation, and sound genre recognition.

Abstract:
Stippling is a popular and fascinating sketching art in stylized illustrations. Various digital stippling techniques have been proposed to reduce tedious manual work. In this paper, we present a novel method to create high-quality color stippling from an input image in milliseconds. The key idea is to obtain stipples with predetermined incremental 2D sample sequences, which algorithms generate with sequential incrementality and distributional uniformity features. Two typical sequences are employed in our work: one is constructed from incremental Voronoi sets, and the other is from Poisson disk distributions. A threshold-based algorithm is then applied to determine stipple appearance and guarantee result quality. We extend color stippling with multitone level and radius adjustment to achieve improved visual quality. Detailed comparisons of the two sequences are conducted to explore further the strengths and weaknesses of the proposed method. For more information, please visit https://gitlab.com/maleiwhat/milliseconds-color-stippling.

Abstract:
Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications. Non-autoregressive image captioning with continuous iterative refinement, which eliminates the sequential dependence in a sentence generation, can achieve comparable performance to the autoregressive counterparts with a considerable acceleration. Nevertheless, based on a well-designed experiment, we empirically proved that iteration times can be effectively reduced when providing sufficient prior knowledge for the language decoder. Towards that end, we propose a novel two-stage framework, referred to as Semi-Autoregressive Image Captioning (SAIC), to make a better trade-off between performance and speed. The proposed SAIC model maintains autoregressive property in global but relieves it in local. Specifically, SAIC model first jumpily generates an intermittent sequence in an autoregressive manner, that is, it predicts the first word in every word group in order. Then, with the help of the partially deterministic prior information and image features, SAIC model non-autoregressively fills all the skipped words with one iteration. Experimental results on the MS COCO benchmark demonstrate that our SAIC model outperforms the preceding non-autoregressive image captioning models while obtaining a competitive inference speedup.

Abstract:
Recognizing images with long-tailed distributions remains a challenging problem while there lacks an interpretable mechanism to solve this problem. In this study, we formulate Long-tailed recognition as Domain Adaption (LDA), by modeling the long-tailed distribution as an unbalanced domain and the general distribution as a balanced domain. Within the balanced domain, we propose to slack the generalization error bound, which is defined upon the empirical risks of unbalanced and balanced domains and the divergence between them. We propose to jointly optimize empirical risks of the unbalanced and balanced domains and approximate their domain divergence by intra-class and inter-class distances, with the aim to adapt models trained on the long-tailed distribution to general distributions in an interpretable way. Experiments on benchmark datasets for image recognition, object detection, and instance segmentation validate that our LDA approach, beyond its interpretability, achieves state-of-the-art performance.

Abstract:
This work targets at the problems of comprehensive video captioning and the generation of multiple descriptions from different perspectives, termed asMulti-Perspective Video Captioning. We build and release a dataset named VidOR-MPVC, the first dataset for multi-perspective video captioning, where each video is annotated with multiple descriptions from different perspectives. We also propose a novel model, dubbedperspective-aware captioner (PAC), which is capable of mining the various perspectives in a video and generating a description from each perspective. More specifically, a perspective generator is designed to perceive video content with perspective preferences, and followed by a language generator equipped with perspective-aware attention mechanism. As our new task expects to produce multiple descriptions for a video, existing evaluation metrics are fail to handle this situation. To address this problem, we devise the maximum matching scores based on existing metrics for an overall evaluation which aims to cover the aspects of semantic similarity, completeness and compactness. The experimental results demonstrate that our model is able to describe videos with multiple descriptions from different perspectives.

Abstract:
This tutorial discusses the trustworthiness issue in multimedia analysis. Starting from introducing two types of spurious correlations learned from distilling human knowledge, we partition the (visual) feature space along two dimensions of task-relevance and semantic-orientation. Trustworthy multimedia analysis ideally relies on the task-relevant semantic features and consists of three modules as trainer, interpreter and tester. These three modules essentially form a closed loop, which respectively address goals of extracting task-relevant features, extracting task-relevant semantic features, and detecting spurious correlations to be corrected by the trainer and interpreter.

Abstract:
Video coding systems, started for TV broadcasting services over satellite and cable networks with limited bandwidth, later on used for surveillance video and internet video, those target on higher compression ratio with lower quality lose, under the trade-off of RDO (rate distortion optimization) model, judged by human experts. In other word, current video coding standards are good for people, for human visual perception, not design for machine intelligence. However, today more and more applications from industry require video coding for machine, which targets to compress image and video for machine usage, object detection and or tracking, image classification, event analysis, and so on, those target on higher compression ratio with higher recognition accuracy, under the trade-off of RAO (rate accuracy optimization) model, judged by system. In this case, video coding needs to do feature compression, which preserves and transmits the most critical information for computer vision and pattern recognition, not for human visual perception. So it is quite different between video coding for human and video coding for machine, even if the two systems will coexist for a long time. In this talk, I will introduce the history of VCM, list some early works on pattern analysis based on compressed data domain, some efforts from ISO/IEC MPEG group on MPEG-7 CDVS (compact descriptor for visual search) and CDVA (compact descriptors for visual analysis), some ongoing projects on AVS working group and MPEG working group, give the key techniques and challenges on VCM, and overview its future.

Abstract:
The combination of the traditional convolutional network (i.e., an auto-encoder) and the graph convolutional network has attracted much attention in clustering, in which the auto-encoder extracts the node attribute feature and the graph convolutional network captures the topological graph feature. However, the existing works (i) lack a flexible combination mechanism to adaptively fuse those two kinds of features for learning the discriminative representation and (ii) overlook the multi-scale information embedded at different layers for subsequent cluster assignment, leading to inferior clustering results. To this end, we propose a novel deep clustering method named Attention-driven Graph Clustering Network (AGCN). Specifically, AGCN exploits a heterogeneity-wise fusion module to dynamically fuse the node attribute feature and the topological graph feature. Moreover, AGCN develops a scale-wise fusion module to adaptively aggregate the multi-scale features embedded at different layers. Based on a unified optimization framework, AGCN can jointly perform feature learning and cluster assignment in an unsupervised fashion. Compared with the existing deep clustering methods, our method is more flexible and effective since it comprehensively considers the numerous and discriminative information embedded in the network and directly produces the clustering results. Extensive quantitative and qualitative results on commonly used benchmark datasets validate that our AGCN consistently outperforms state-of-the-art methods.

Abstract:
Traditional learning systems are trained in closed-world for a fixed number of classes, and need pre-collected datasets in advance. However, new classes often emerge in real-world applications and should be learned incrementally. For example, in electronic commerce, new types of products appear daily, and in a social media community, new topics emerge frequently. Under such circumstances, incremental models should learn several new classes at a time without forgetting. We find a strong correlation between old and new classes in incremental learning, which can be applied to relate and facilitate different learning stages mutually. As a result, we propose CO-transport for class Incremental Learning (COIL), which learns to relate across incremental tasks with the class-wise semantic relationship. In detail, co-transport has two aspects: prospective transport tries to augment the old classifier with optimal transported knowledge as fast model adaptation. Retrospective transport aims to transport new class classifiers backward as old ones to overcome forgetting. With these transports, COIL efficiently adapts to new tasks, and stably resists forgetting. Experiments on benchmark and real-world multimedia datasets validate the effectiveness of our proposed method.

Abstract:
Recent studies of deep learning based stereo image super-resolution (StereoSR) have promoted the development of StereoSR. However, existing StereoSR models mainly concentrate on improving quantitative evaluation metrics and neglect the visual quality of super-resolved stereo images. To improve the perceptual performance, this paper proposes the first perception-oriented stereo image super-resolution approach by exploiting the feedback, provided by the evaluation on the perceptual quality of StereoSR results. To provide accurate guidance for the StereoSR model, we develop the first special stereo image super-resolution quality assessment (StereoSRQA) model, and further construct a StereoSRQA database. Extensive experiments demonstrate that our StereoSR approach significantly improves the perceptual quality and enhances the reliability of stereo images for disparity estimation.

Abstract:
Rendering plays an important role in many fields such as virtual reality and film, but the high dependence on computing sources and human experience hinders its application. With the development of deep learning, neural rendering has attracted much attention due to its impressive performance and efficiency than traditional rendering. In this paper, we mainly introduce two neural rendering works, one is rendering simulation and the other is image-based novel view rendering. Moreover, we also discuss the potential applications (i.e. data augmentation) based on the results of neural rendering, which has received little attention.

Abstract:
Conventional Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a well-labeled source domain to an unlabeled target domain only when data from both domains is simultaneously accessible, which is challenged by the recent Source-free Domain Adaptation (SFDA). However, we notice that the performance of existing SFDA methods would be dramatically degraded by intra-domain class imbalance and inter-domain label shift. Unfortunately, class-imbalance is a common phenomenon in real-world domain adaptation applications. To address this issue, we present Imbalanced Source-free Domain Adaptation (ISFDA) in this paper. Specifically, we first train a uniformed model from the source domain, and then propose secondary label correction, curriculum sampling, plus intra-class tightening and inter-class separation to overcome the joint presence of covariate shift and label shift. Extensive experiments on three imbalanced benchmarks verify that ISFDA could perform favorably against existing UDA and SFDA methods under various conditions of class-imbalance, and outperform existing SFDA methods by over 15% in terms of per-class average accuracy on a large-scale long-tailed imbalanced dataset.

Abstract:
With the rapid development of deep learning-based techniques, the general public can use a lot of "machine learning as a service" (MLaaS), which provides end-to-end machine learning solutions. Taking the image classification task as an example, users only need to update their dataset and labels to MLaaS without requiring the specific knowledge of machine learning or a concrete structure of the classifier. Afterward, MLaaS returns a well-trained classifier to them. In this paper, we explore a potential novel task named "deep neural network retrieval" and its application which helps MLaaS to save computation resources. MLaaS usually owns a huge amount of well-trained models for various tasks and datasets. If a user requires a task that is similar to the one having been finished previously, MLaaS can quickly retrieve a model rather than training from scratch. We propose a pragmatic solution and two different approaches to extract the semantic feature of DNN representing the function of DNN, which is analogous to the usage of word2vec in natural language processing. The semantic feature of DNN can be expressed as a vector by feeding some well-designed litmus images into the DNN or as a matrix by reversely constructing the most desired input of DNN. Both methods can consider the topological information and parameters of the DNN simultaneously. Extensive experiments, including multiple datasets and networks, also demonstrate the efficiency of our method and show the high accuracy of deep neural network retrieval.

Abstract:
Video Visual Relation Detection (VidVRD) aims to semantically describe the dynamic interactions across visual concepts localized in a video in the form of subject, predicate, object. It can help to mitigate the semantic gap between vision and language in video understanding, thus receiving increasing attention in multimedia communities. Existing efforts primarily leverage the multimodal/spatio-temporal feature fusion to augment the representation of object trajectories as well as their interactions and formulate the prediction of predicates as a multi-class classification task. Despite their effectiveness, existing models ignore the severe long-tailed bias in VidVRD datasets. As a result, the models' prediction will be easily biased towards the popular head predicates (e.g., next-to and in-front-of), thus leading to poor generalizability.

Abstract:
Motion capture (MoCap) technology aims to provide an accurate record of human motion, with specific potentials in activity analysis, human behavior understanding, as well as multimedia industries of animation production and special effects movies. However, because of joint occlusion and limitation of equipment precision, the raw motion data are often damaged, which severely hinders its downstream applications. The latest method relies on deep neural networks to reconstruct the underlying complete motion from the degraded observation, achieving remarkable results. Unfortunately, due to the non-enumerability of human motion, the trained model from large-scale training data often fails to comprehensively cover incomputable action categories, which may lead to a sharp decline in the performance of deep learning-based methods. To handle these limitations, we propose an untrained deep generative model, in which Graph Convolutional Networks (GCNs) are utilized to efficiently capture complicated topological relationships of human joints. We show that the untrained GCN architecture with randomly-initialized weights is sufficient to extract some low-level statistics for human motion reconstruction without any training process. Notably, the performance of our approach is comparable to that of those trained models, while its application is not restricted by the availability of training data or a pre-trained network. Moreover, the proposed model even surpasses the state-of-the-art methods when encountering unprecedented samples in the human action database, regardless of the tasks of human motion recovery and gap-filling problem.

Abstract:
Inspired by the success of BERT, several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. In particular, LXMERT and UNITER adopt visual region feature regression and label classification as pretext tasks. However, they tend to suffer from the problems of noisy labels and sparse semantic annotations, based on the visual features having been pretrained on a crowdsourced dataset with limited and inconsistent semantic labeling. To overcome these issues, we propose unbiased Dense Contrastive Visual-Linguistic Pretraining (DCVLP), which replaces the region regression and classification with cross-modality region contrastive learning that requires no annotations. Two data augmentation strategies (Mask Perturbation and Intra-Inter-Adversarial Perturbation) are developed to improve the quality of negative samples used in contrastive learning. Overall, DCVLP allows cross-modality dense region contrastive learning in a self-supervised setting independent of any object annotations. We compare our method against prior visual-linguistic pretraining frameworks to validate the superiority of dense contrastive learning on multimodal representation learning.

Abstract:
Deep learning has achieved great success in recognizing video actions, but the collection and annotation of training data are still quite laborious, which mainly lies in two aspects: (1) the amount of required annotated data is large; (2) temporally annotating the location of each action is time-consuming. Works such as few-shot learning or untrimmed video recognition have been proposed to handle either one aspect or the other. However, very few existing works can handle both issues simultaneously. In this paper, we target a new problem, Annotation-Efficient Video Recognition, to reduce the requirement of annotations for both large amount of samples and the action location. Such problem is challenging due to two aspects: (1) the untrimmed videos only have weak supervision; (2) video segments not relevant to current actions of interests (background, BG) could contain actions of interests (foreground, FG) in novel classes, which is a widely existing phenomenon but has rarely been studied in few-shot untrimmed video recognition. To achieve this goal, by analyzing the property of BG, we categorize BG into informative BG (IBG) and non-informative BG (NBG), and we propose (1) an open-set detection based method to find the NBG and FG, (2) a contrastive learning method to learn IBG and distinguish NBG in a self-supervised way, and (3) a self-weighting mechanism for the better distinguishing of IBG and FG. Extensive experiments on ActivityNet v1.2 and ActivityNet v1.3 verify the rationale and effectiveness of the proposed methods.

Abstract:
With the rising popularity of intelligent mobile devices, it is of great practical significance to develop accurate, real-time and energy-efficient image Super-Resolution (SR) methods. A prevailing method for improving inference efficiency is model quantization, which allows for replacing the expensive floating-point operations with efficient bitwise arithmetic. To date, it is still challenging for quantized SR frameworks to deliver a feasible accuracy-efficiency trade-off. Here, we propose a Fully Quantized image Super-Resolution framework (FQSR) to jointly optimize efficiency and accuracy. In particular, we target obtaining end-to-end quantized models for all layers, especially including skip connections, which was rarely addressed in the literature of SR quantization. We further identify obstacles faced by low-bit SR networks and propose a novel method to counteract them accordingly. The difficulties are caused by 1) for SR task, due to the existence of skip connections, high-resolution feature maps would occupy a huge amount of memory spaces; 2) activation and weight distributions being vastly distinctive in different layers; 3) the inaccurate approximation of the quantization. We apply our quantization scheme on multiple mainstream super-resolution architectures, including SRResNet, SRGAN and EDSR. Experimental results show that our FQSR with low-bits quantization is able to achieve on par performance compared with the full-precision counterparts on five benchmark datasets and surpass the state-of-the-art quantized SR methods with significantly reduced computational cost and memory consumption. Code is available at https://git.io/JWxPp.

Abstract:
The limited spatial and angular resolutions in multi-view multimedia applications restrict their visual experience in practical use. In this paper, we first argue the space-angle super-resolution (SASR) problem for irregular arranged multi-view images. It aims to increase the spatial resolution of source views and synthesize arbitrary virtual high resolution (HR) views between them jointly. One feasible solution is to perform super-resolution (SR) and view synthesis (VS) methods separately. However, it cannot fully exploit the intra-relationship between SR and VS tasks. Intuitively, multi-view images can provide more angular references, and higher resolution can provide more high-frequency details. Therefore, we propose a one-stage space-angle super-resolution network called SASRnet, which simultaneously synthesizes real and virtual HR views. Extensive experiments on several benchmarks demonstrate that our proposed method outperforms two-stage methods, meanwhile prove that SR and VS can promote each other. To our knowledge, this work is the first to address the SASR problem for unstructured multi-view images in an end-to-end learning-based manner.

Abstract:
Text entry takes an important role of effectively delivering the intention of users to computers, where physical and soft keyboards have been widely used. However, with the recent trends of developing technologies like augmented reality and increasing contactless services due to COVID-19, a more advanced type of text entry is required. To tackle this issue, we propose Air-Text which is an intuitive system to write in the air using fingertips as a pen. Unlike previously suggested air-writing systems, Air-Text provides various functionalities by the seamless integration of air-writing and text-recognition modules. Specifically, the air-writing module takes a sequence of RGB images as input and tracks both the location of fingertips (5.33 pixel error in 640x480 image) and current hand gesture class (98.29% classification accuracy) frame by frame. Users can easily perform writing operations such as writing or deleting a text by changing hand gestures, and tracked fingertip locations can be stored as a binary image. Then the text-recognition module, which is compatible with any pre-trained recognition models, predicts a written text in the binary image. In this paper, examples of single digit recognition with MNIST classifier (96.0% accuracy) and word-level recognition with text recognition model (79.36% character recognition rate) are provided.

Abstract:
Neural network pruning and quantization are two major lines of network compression. This raises a natural question that whether we can find the optimal compression by considering multiple network compression criteria in a unified framework. This paper incorporates two criteria and seeks layer-wise compression by leveraging the meta-learning framework. A regularization loss is applied to unify the constraint of input and output channel numbers, bit-width of network activations and weights, so that the compressed network can satisfy a given Bit-OPerations counts (BOPs) constraint. We further propose an iterative compression constraint for optimizing the compression procedure, which effectively achieves a high compression rate and maintains the original network performance. Extensive experiments on various networks and vision tasks show that the proposed method yields better performance and compression rates than recent methods. For instance, our method achieves better image classification accuracy and compactness than the recent DJPQ. It achieves similar performance with the recent DHP in image super-resolution, meanwhile saves about 50% computation.

Abstract:
One of the intelligent transportation system's critical tasks is to understand traffic signs and convey traffic information to humans. However, most related works are focused on the detection and recognition of traffic sign texts or symbols, which is not sufficient for understanding. Besides, there has been no public dataset for traffic sign understanding research. Our work takes the first step towards addressing this problem. First, we propose a "CASIA-Tencent Chinese Traffic Sign Understanding Dataset" (CTSU Dataset), which contains 5000 images of traffic signs with rich semantic descriptions. Second, we introduce a novel multi-task learning architecture that extracts text and symbol information from traffic signs, reasons the relationship between texts and symbols, classifies signs into different categories, and finally, composes the descriptions of the signs. Experiments show that the task of traffic sign understanding is achievable, and our architecture demonstrates state-of-the-art and superior performance. The CTSU Dataset is available at http://www.nlpr.ia.ac.cn/databases/CASIA-Tencent%20CTSU/index.html.

Abstract:
Video conferencing applications have seen explosive growth both in the number of available applications and their use. However, there have been few studies on the detailed analysis of video conferencing applications with respect to network dynamics, yet understanding these dynamics is essential for network design and improving these applications. In this paper, we carry out an in-depth measurement and modeling study on the rate control algorithms used in six popular commercial video conferencing applications. Based on macroscopic behaviors commonly observed across these applications in our extensive measurements, we construct a unified architecture to model the rate control mechanisms of individual applications. We then reconstruct each application's rate control by inferring key parameters that closely follow its rate control and quality adaptation behaviors. To our knowledge, this is the first work that reverse-engineers rate control algorithms of popular video conferencing applications, which are often unknown or hidden as they are proprietary software. We confirm our analysis and models using an end-to-end testbed that can capture the dynamics of each application under a variety of network conditions. We also show how we can use these models to gain insights into the particular behaviors of an application in two practical scenarios.

Abstract:
The low spatial resolution of acquired depth maps is a major drawback of most RGBD sensors. However, there are many scenarios in which fast acquisition of high-resolution and high-quality depth maps would be desirable. One approach to achieve higher quality depth maps is through super-resolution. However, edge preservation is challenging, and artifacts such as depth confusion and blurring are easily introduced near boundaries. In view of this, we propose a method for fast, high-quality hierarchical depth-map super-resolution (HDS). In our method, a high-resolution RGB image is degraded layer by layer to guide the bilateral filtering of the depth map. To improve the upsampled depth map quality, we construct a feature-based bilateral filter (FBF) for the interpolation, by using the extracted RGB shallow and multi-layer features. To accelerate the process, we perform filtering only near depth boundaries and through matrix operations. We also propose an extension of our HDS model to a Classification-based Hierarchical Depth-map Super-resolution (C-HDS) model, where a context-aware trilateral filter reduces the contributions of unreliable neighbors to the current missing depth location. Experimental results show that the proposed method is significantly faster than existing methods for generating high-resolution depth maps, while also significantly improving depth quality compared to the current state-of-the-art approaches, especially for large-scale 16x super-resolution.

Abstract:
Visual question generation task aims to generate meaningful questions about an image according to a target answer. Existing studies mainly focus on merely one object related to the target answer in an image to generate a question. However, a target answer is often related to multiple key objects in an image, which focuses on only one object may mislead its model to generate questions that are only related to partial fragments of the answer. To address this problem, we propose a multi-objects aware generation model to capture all key objects related to an answer and generate the corresponding question. We first introduce a co-attention network to capture the relationship between each object in an image and the answer, and then extract the key objects that are related to the answer. Then, a graph network is introduced to capture the relationships between the key objects and other objects in the image that are not related to the answer, which helps generate questions that involve more visual content. Finally, the learned information from the graph network is fed into a standard decoder module to produce questions. Extensive experiments on the VQA v2.0 dataset show that the proposed model outperforms the state-of-the-art models.

Abstract:
Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps between videos and texts. To bridge this gap, in this paper, we propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM). This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation. Besides, unlike most existing models using LSTM or GRU as the sentence decoder, we adopt a Transformer structured decoder network to effectively learn the long-range visual and language dependency. Additionally, we introduce a novel ensemble strategy for captioning tasks. Experimental results demonstrate the effectiveness of our method on two datasets: 1) on MSR-VTT dataset, our method achieved a new state-of-the-art result with a significant gain of up to 10% in CIDEr; 2) on the private test data, our method ranking 2nd place in the ACM MM multimedia grand challenge 2021: Pre-training for Video Understanding Challenge. It is noted that our model is only trained on the MSR-VTT dataset.

Abstract:
Hateful and offensive content detection has been extensively explored in a single modality such as text. However, such toxic information could also be communicated via multimodal content such as online memes. Therefore, detecting multimodal hateful content has recently garnered much attention in academic and industry research communities. This paper aims to contribute to this emerging research topic by proposing DisMultiHate, which is a novel framework that performed the classification of multimodal hateful content. Specifically, DisMultiHate is designed to disentangle target entities in multimodal memes to improve the hateful content classification and explainability. We conduct extensive experiments on two publicly available hateful and offensive memes datasets. Our experiment results show that DisMultiHate is able to outperform state-of-the-art unimodal and multimodal baselines in the hateful meme classification task. Empirical case studies were also conducted to demonstrate DisMultiHate's ability to disentangle target entities in memes and ultimately showcase DisMultiHate's explainability of the multimodal hateful content classification task.

Abstract:
Recommending purely cold-start items is a long-standing and fundamental challenge in the recommender systems. Without any historical interaction on cold-start items, the collaborative filtering (CF) scheme fails to leverage collaborative signals to infer user preference on these items. To solve this problem, extensive studies have been conducted to incorporate side information of items (e.g. content features) into the CF scheme. Specifically, they employ modern neural network techniques (e.g., dropout, consistency constraint) to discover and exploit the coalition effect of content features and collaborative representations. However, we argue that these works less explore the mutual dependencies between content features and collaborative representations and lack sufficient theoretical supports, thus resulting in unsatisfactory performance on cold-start recommendation.

Abstract:
Compression standards have been used to reduce the cost of image storage and transmission for decades. In recent years, learned image compression methods have been proposed and achieved compelling performance to the traditional standards. However, in these methods, a set of different networks are used for various compression rates, resulting in a high cost in model storage and training. Although some variable-rate approaches have been proposed to reduce the cost by using a single network, most of them brought some performance degradation when applying fine rate control. To enable variable-rate control without sacrificing the performance, we propose an efficient Interpolation Variable-Rate (IVR) network, by introducing a handy Interpolation Channel Attention (InterpCA) module in the compression network. With the use of two hyperparameters for rate control and linear interpolation, the InterpCA achieves a fine PSNR interval of 0.001 dB and a fine rate interval of 0.0001 Bits-Per-Pixel (BPP) with 9000 rates in the IVR network. Experimental results demonstrate that the IVR network is the first variable-rate learned method that outperforms VTM 9.0 (intra) in PSNR and Multiscale Structural Similarity (MS-SSIM).

Abstract:
This tutorial provides an in-depth understanding of the art and science behind the decision-making in a multimedia classifier. A multimedia classifier typically takes image, text, waveform, ordinal number or categorical data or their combination as the input and produces a single output indicating the class of the input pattern. Such a piece of AI ML system is extensively used as a decision-making element in several autonomous systems. The yardstick used by the human experts for decision making for the same input pattern often differs from the system, still producing the same output. In some cases, the outputs differ for the same input data and throws open a question on the reliability of the model. If such models are used in critical applications, which is often the case in an autonomous system, adequate mitigations for minimizing impact of the misjudgment has to be taken. It calls for ripping open the decision making process in the black box classifiers. Unwinding the black box is the need of the hour for the regulatory bodies as well. EU region has already made it mandatory to provide the details of the decision-making mechanism if it involves some form of AI ML components. This tutorial throws light on the decision making process in a classifier that may be used for a variety of applications. More than one technique to get a glimpse of the classifier in action would be discussed. The explanation can come in the form of ta heatmap indicating the relevant features influencing the decision making process, patterns learnt by the neurons or Textual description of the attributes of the input. The bottom line is these explanations are to be consistent. The mechanism to achieve the coherent explanation would be detailed in this tutorial.

Abstract:
Recent deep learning methods rely on a large amount of labeled data to achieve high performance. These methods may be impractical in some scenarios, where manual data annotation is costly or the samples of certain categories are scarce (e.g., tumor lesions, endangered animals and rare individual activities). When only limited annotated samples are available, these methods usually suffer from the overfitting problem severely, which degrades the performance significantly. In contrast, humans can recognize the objects in the images rapidly and correctly with their prior knowledge after exposed to only a few annotated samples. To simulate the learning schema of humans and relieve the reliance on the large-scale annotation benchmarks, researchers start shifting towards the few-shot learning problem: they try to learn a model to correctly recognize novel categories with only a few annotated samples.

Abstract:
Food and cooking analysis present exciting research and application challenges for modern AI systems, particularly in the context of multimodal data such as images or video. A meal that appears in a food image is a product of a complex progression of cooking stages, often described in the accompanying textual recipe form. In the cooking process, individual ingredients change their physical properties, become combined with other food components, all to produce a final, yet highly variable, appearance of the meal. Recognizing food items or meals on a plate from images or videos, their physical properties such as the amount, nutritional content such as the caloric value, food attributes such as the flavor, elucidating the cooking process behind it, or creating robotic assistants that help users complete that cooking process, is of essential scientific and technological value yet technically extremely challenging. The 3rd AIxFood workshop was held as a half-day workshop in conjunction with the 29th ACM International Conference on Multimedia (ACM MM 2021), in Chengdu, China and virtually.

Abstract:
Typical image composition harmonizes regions from different images to a single plausible image. We extend the idea of image composition by introducing the content-style decomposition and combination to form the concept of image re-composition. In other words, our image re-composition could arbitrarily combine those contents and styles decomposed from different images to generate more diverse images in a unified framework. In the decomposition stage, we incorporate the whitening normalization to obtain a more thorough content-style decoupling, which substantially improves the re-composition results. Moreover, to handle the variation of structure and texture of different objects in an image, we design the network to support regional feature representation and achieve region-aware content-style decomposition. Regarding the composition stage, we propose a cycle consistency loss to constrain the network preserving the content and style information during the composition. Our method can produce diverse re-composition results, including content-content, content-style and style-style. Our experimental results demonstrate a large improvement over the current state-of-the-art methods.

Abstract:
Multi-view Multi-human association and tracking (MvMHAT) aims to track a group of people over time in each view, as well as to identify the same person across different views at the same time. This is a relatively new problem but is very important for multi-person scene video surveillance. Different from previous multiple object tracking (MOT) and multi-target multi-camera tracking (MTMCT) tasks, which only consider the over-time human association, MvMHAT requires to jointly achieve both cross-view and over-time data association. In this paper, we model this problem with a self-supervised learning framework and leverage an end-to-end network to tackle it. Specifically, we propose a spatial-temporal association network with two designed self-supervised learning losses, including a symmetric-similarity loss and a transitive-similarity loss, at each time to associate the multiple humans over time and across views. Besides, to promote the research on MvMHAT, we build a new large-scale benchmark for the training and testing of different algorithms. Extensive experiments on the proposed benchmark verify the effectiveness of our method. We have released the benchmark and code to the public.

Abstract:
Most of the existing works in supervised spatio-temporal video super-resolution (STVSR) heavily rely on a large-scale external dataset consisting of paired low-resolution low-frame rate (LR-LFR) and high-resolution high-frame-rate (HR-HFR) videos. Despite their remarkable performance, these methods make a prior assumption that the low-resolution video is obtained by down-scaling the high-resolution video using a known degradation kernel, which does not hold in practical settings. Another problem with these methods is that they cannot exploit instance-specific internal information of a video at testing time. Recently, deep internal learning approaches have gained attention due to their ability to utilize the instance-specific statistics of a video. However, these methods have a large inference time as they require thousands of gradient updates to learn the intrinsic structure of the data. In this work, we present Adaptive VideoSuper-Resolution (Ada-VSR) which leverages external, as well as internal, information through meta-transfer learning and internal learning, respectively. Specifically, meta-learning is employed to obtain adaptive parameters, using a large-scale external dataset, that can adapt quickly to the novel condition (degradation model) of the given test video during the internal learning task, thereby exploiting external and internal information of a video for super-resolution. The model trained using our approach can quickly adapt to a specific video condition with only a few gradient updates, which reduces the inference time significantly. Extensive experiments on standard datasets demonstrate that our method performs favorably against various state-of-the-art approaches.

Abstract:
Recent advances in unsupervised domain adaptation have achieved remarkable performance on semantic segmentation tasks. Despite such progress, existing works mainly focus on bridging the inter-domain gaps between the source and target domain, while only few of them noticed the intra-domain gaps within the target data. In this work, we propose a pixel-level intra-domain adaptation approach to reduce the intra-domain gaps within the target data. Compared with image-level methods, ours treats each pixel as an instance, which adapts the segmentation model at a more fine-grained level. Specifically, we first conduct the inter-domain adaptation between the source and target domain; Then, we separate the pixels in target images into the easy and hard subdomains; Finally, we propose a pixel-level adversarial training strategy to adapt a segmentation network from the easy to the hard subdomain. Moreover, we show that the segmentation accuracy can be further improved by incorporating a continuous indexing technique in the adversarial training. Experimental results show the effectiveness of our method against existing state-of-the-art approaches.

Abstract:
Multi-person action forecasting is an emerging task and a pivotal step towards video understanding. The major challenge lies in estimating a distribution characterizing the upcoming actions of all individuals in the scene. The state-of-the-art solutions attempt to solve this problem via a step-by-step prediction procedure. However, they are not adequate to address some particular limitations, such as the compounding errors, the innate uncertainty of the future and the spatio-temporal contexts. To handle the multi-person action forecasting challenges, we put forth a novel imitative learning framework upon the basis of inverse reinforcement learning. Specifically, we aim to learn a policy to model the aforementioned distribution up to a coming horizon through an objective that naturally solves the compounding errors. Such a policy is able to explore multiple plausible futures via extrapolating a series of latent variables and taking them into account to generate predictions. The impacts of these latent variables are further investigated by optimizing the directed information. Moreover, we reason the spatial context along with the temporal cue in a single pass with the usage of graph structural data. The experimental outcomes on two large-scale datasets reveal that our approach yields considerable improvements in terms of both diversity and quality with respect to recent leading studies.

Abstract:
Multi-label image recognition aims to recognize multiple objects simultaneously in one image. Recent ideas to solve this problem have focused on learning dependencies of label co-occurrences to enhance the high-level semantic representations. However, these methods usually neglect the important relations of intrinsic visual structures and face difficulties in understanding contextual relationships. To build the global scope of visual context as well as interactions between visual modality and linguistic modality, we propose the Multi-Modal Multi-label recognition TRansformers (M3TR) with the ternary relationship learning for inter-and intra-modalities. For the intra-modal relationship, we make insightful conjunction of CNNs and Transformers, which embeds visual structures into the high-level features by learning the semantic cross-attention. For constructing the interactions between the visual and linguistic modalities, we propose a linguistic cross-attention to embed the class-wise linguistic information into the visual structure learning, and finally present a linguistic guided enhancement module to enhance the representation of high-level semantics. Experimental evidence reveals that with the collaborative learning of ternary relationship, our proposed M3TR achieves new state-of-the-art on two public multi-label recognition benchmarks.

Abstract:
It is generally known that a high-resolution (HR) image contains more productive information compared with its low-resolution (LR) versions, so image super-resolution (SR) satisfies an information-growth process. Considering the property, we attempt to exploit the growing information via a particular attention mechanism. In this paper, we propose a concise but effective Information-Growth Attention Network (IGAN) that shows the incremental information is beneficial for SR. Specifically, a novel information-growth attention is proposed. It aims to pay attention to features involving large information-growth capacity by assimilating the difference from current features to the former features within a network. We also illustrate its effectiveness contrasted by widely-used self-attention using entropy and generalization analysis. Furthermore, existing channel-wise attention generation modules (CAGMs) have large informational attenuation due to directly calculating global mean for feature maps. Therefore, we present an innovative CAGM that progressively decreases feature maps' sizes, leading to more adequate feature exploitation. Extensive experiments also demonstrate IGAN outperforms state-of-the-art attention-aware SR approaches.

Abstract:
This paper investigates an open research task of text-to-image synthesis for automatically generating or manipulating images from text descriptions. Prevailing methods mainly take the textual descriptions as the conditional input for the GAN generation, and need to train different models for the text-guided image generation and manipulation tasks. In this paper, we propose a novel unified framework of Cycle-consistent Inverse GAN (CI-GAN) for both text-to-image generation and text-guided image manipulation tasks. Specifically, we first train a GAN model without text input, aiming to generate images with high diversity and quality. Then we learn a GAN inversion model to convert the images back to the GAN latent space and obtain the inverted latent codes for each image, where we introduce the cycle-consistency training to learn more robust and consistent inverted latent codes. We further uncover the semantics of the latent space of the trained GAN model, by learning a similarity model between text representations and the latent codes. In the text-guided optimization module, we can generate images with the desired semantic attributes through optimization on the inverted latent codes. Extensive experiments on the Recipe1M and CUB datasets validate the efficacy of our proposed framework.

Abstract:
Face swapping aims to synthesize a face image, in which the facial identity is well transplanted from the source image and the context (e.g., hairstyle, head posture, facial expression, lighting, and background) keeps consistent with the reference image. The prior work mainly accomplishes the task in two stages, i.e., generating the inner face with the source identity, and then stitching the generation with the complementary part of the reference image by image blending techniques. The blending mask, which is usually obtained by the additional face segmentation model, is a common practice towards photo-realistic face swapping. However, artifacts usually appear at the blending boundary, especially in areas occluded by the hair, eyeglasses, accessories, etc. To address this problem, rather than struggling with the blending mask in the two-stage routine, we develop a novel one-stage context and identity hallucination network, which learns a series of hallucination maps to softly divide the context areas and identity areas. For context areas, the features are fully utilized by a multi-level context encoder. For identity areas, we design a novel two-cascading AdaIN to transfer the identity while retaining the context. Besides, with the help of hallucination maps, we introduce an effectively improved reconstruction loss to utilize unlimited unpaired face images for training. Our network performs well on both context areas and identity areas without any dependency on post-processing. Extensive qualitative and quantitative experiments demonstrate the superiority of our network.

Abstract:
With the rapid development of social media, tremendous videos with new classes are generated daily, which raise an urgent demand for video classification methods that can continuously update new classes while maintaining the knowledge of old videos with limited storage and computing resources. In this paper, we summarize this task as Class-Incremental Video Classification (CIVC) and propose a novel framework to address it. As a subarea of incremental learning tasks, the challenge of catastrophic forgetting is unavoidable in CIVC. To better alleviate it, we utilize some characteristics of videos. First, we decompose the spatio-temporal knowledge before distillation rather than treating it as a whole in the knowledge transfer process; trajectory is also used to refine the decomposition. Second, we propose a dual granularity exemplar selection method to select and store representative video instances of old classes and key-frames inside videos under a tight storage budget. We benchmark our method and previous SOTA class-incremental learning methods on Something-Something V2 and Kinetics datasets, and our method outperforms previous methods significantly.

Abstract:
Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two different views is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information to generate egocentric video sequences from the exocentric view. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion. First, the temporal and spatial branches generate a sequence of fake frames and their corresponding features. The fake frames are generated in both downstream and upstream directions for both temporal and spatial branches. Next, the generated four different fake frames and their corresponding features (spatial and temporal branches in two directions) are fed into a novel multi-generation attention fusion module to produce the final video sequence. Meanwhile, we also propose a novel temporal and spatial dual-discriminator for more robust network optimization. Extensive experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly outperforms the existing methods.

Abstract:
With the fast proliferation of online video sites and social media platforms, user, professionally and occupationally generated content (UGC, PGC, OGC) videos are streamed and explosively shared over the Internet. Consequently, it is urgent to monitor the content quality of these Internet videos to guarantee the user experience. However, most existing modern video quality assessment (VQA) databases only include UGC videos and cannot meet the demands for other kinds of Internet videos with real-world distortions. To this end, we collect 1,072 videos from Youku, a leading Chinese video hosting service platform, to establish the Internet video quality assessment database (Youku-V1K). A special sampling method based on several quality indicators is adopted to maximize the content and distortion diversities within a limited database, and a probabilistic graphical model is applied to recover reliable labels from noisy crowdsourcing annotations. Based on the properties of Internet videos originated from Youku, we propose a spatio-temporal distortion-aware model (STDAM). First, the model works blindly which means the pristine video is unnecessary. Second, the model is familiar with diverse contents by pre-training on the large-scale image quality assessment databases. Third, to measure spatial and temporal distortions, we introduce the graph convolution and attention module to extract and enhance the features of the input video. Besides, we leverage the motion information and integrate the frame-level features into video-level features via a bi-directional long short-term memory network. Experimental results on the self-built database and the public VQA databases demonstrate that our model outperforms the state-of-the-art methods and exhibits promising generalization ability.

Abstract:
Directly deploying a trained multi-modal classifier to a new environment usually leads to poor performance due to the well-known domain shift problem. Existing multi-modal domain adaptation methods treated each modality equally and optimize the sub-models of different modalities synchronously. However, as observed in this paper, the degrees of domain shift in different modalities are usually diverse. We propose a novel Differentiated Learning framework to make use of the diversity between multiple modalities for more effective domain adaptation. Specifically, we model the classifiers of different modalities as a group of teacher/student sub-models, and a novel Prototype based Reliability Measurement is presented to estimate the reliability of the recognition results made by each sub-model on the target domain. More reliable results are then picked up as teaching materials for all sub-models in the group. Considering the diversity of different modalities, each sub-model performs the Asynchronous Curriculum Learning by choosing the teaching materials from easy to hard measured by itself. Furthermore, a reliability-aware fusion scheme is proposed to combine all optimized sub-models to support final decision. Comprehensive experiments based on three multi-modal datasets with different learning tasks have been conducted, which show the superior performance of our model while comparing with state-of-the-art multi-modal domain adaptation models.

Abstract:
In the artwork Syntropic Counterpoints: Metaphysics of The Machines, we tend to explore phenomena of AI aesthetic and challenge machine abstraction. Our approach toward the liberation of machine creativity is through the use of words and grammar as a creative tool humans developed to express worlds "beyond" the world, existing and non-existing realities. We are lead by Nietzsche's claim that grammar is the "Metaphysics of the People," as such grammar, content, and vision generated during the philosophical discussion between our AI clones is "Metaphysics of Machines" through we can experience their realities and start to question our own.

Abstract:
This paper proposes a dynamic facial expression recognition transformer (Former-DFER) for the in-the-wild scenario. Specifically, the proposed Former-DFER mainly consists of a convolutional spatial transformer (CS-Former) and a temporal transformer (T-Former). The CS-Former consists of five convolution blocks and N spatial encoders, which is designed to guide the network to learn occlusion and pose-robust facial features from the spatial perspective. And the temporal transformer consists of M temporal encoders, which is designed to allow the network to learn contextual facial features from the temporal perspective. The heatmaps of the leaned facial features demonstrate that the proposed Former-DFER is capable of handling the issues such as occlusion, non-frontal pose, and head motion. And the visualization of the feature distribution shows that the proposed method can learn more discriminative facial features. Moreover, our Former-DFER also achieves state-of-the-art results on the DFEW and AFEW benchmarks.

Abstract:
Domain generalization aims to learn a model that generalizes to unseen target domains from multiple source domains. Various approaches have been proposed to address this problem by adversarial learning, meta-learning, and data augmentation. However, those methods have no guarantee for target domain generalization. Motivated by an observation that the class-irrelevant information of sample in the form of semantic variation would lead to negative transfer, we propose to linearly disentangle the variation out of sample in feature space and impose a novel class decorrelation regularization on the feature variation. By doing so, the model would focus on the high-level categorical concept for model prediction while ignoring the misleading clue from other variations (including domain changes). As a result, we achieve state-of-the-art performances over all of widely used domain generalization benchmarks, namely PACS, VLCS, Office-Home, and Digits-DG with large margins. Further analysis reveals our method could learn a better domain-invariant representation, and decorrelated feature variation could successfully capture semantic meaning.

Abstract:
With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, summaries of videos become important. The prior works in multimodal video summarization mainly explore visual and ASR tokens as two separate sources and struggle to fuse the multimodal information for generating the summaries. However, the time information inside videos is commonly ignored. In this paper, we find that it is important to leverage the timestamps to accurately incorporate multimodal signals for the task. We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive attention mechanism. The attention mechanism can attend the inputs differently based on time difference to explore the time information inherent inside video more thoroughly. As such, TAMT can fuse the different modalities better for summarizing the videos. Experiments show that our proposed approach is effective and achieves the state-of-the-art performances on both YouCookII and open-domain How2 datasets.

Abstract:
Skeleton-based action recognition has attracted great interest due to low cost of skeleton data acquisition and high robustness to external conditions. A challenging problem of skeleton-based action recognition is the large intra-class gap caused by various viewpoints of skeleton data, which makes the action modeling difficult for network. To alleviate this problem, a feasible solution is to utilize label supervised methods to learn a view-normalization model. However, since the skeleton data in real scenes is acquired from diverse viewpoints, it is difficult to obtain the corresponding view-normalized skeleton as label. Therefore, how to learn a view-normalization model without the supervised label is the key to solving view-variance problem. To this end, we propose a view normalization-based action recognition framework, which is composed of view-normalization generative adversarial network (VN-GAN) and classification network. For VN-GAN, the model is designed to learn the mapping from diverse-view distribution to normalized-view distribution. In detail, it is implemented by graph convolution, where the generator predicts the transformation angles for view normalization and discriminator classifies the real input samples from the generated ones. For classification network, view-normalized data is processed to predict the action class. Without the interference of view variances, classification network can extract more discriminative feature of action. Furthermore, by combining the joint and bone modalities, the proposed method reaches the state-of-the-art performance on NTU RGB+D and NTU-120 RGB+D datasets. Especially in NTU-120 RGB+D, the accuracy is improved by 3.2% and 2.3% under cross-subject and cross-set criteria, respectively.

Abstract:
We study the problem of weakly supervised grounded image captioning. That is, given an image, the goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image. This task is challenging due to the lack of explicit fine-grained region word alignments as supervision. Previous weakly supervised methods mainly explore various kinds of regularization schemes to improve attention accuracy. However, their performances are still far from the fully supervised ones. One main issue that has been ignored is that the attention for generating visually groundable words may only focus on the most discriminate parts and can not cover the whole object. To this end, we propose a simple yet effective method to alleviate the issue, termed as partial grounding problem in our paper. Specifically, we design a distributed attention mechanism to enforce the network to aggregate information from multiple spatially different regions with consistent semantics while generating the words. Therefore, the union of the focused region proposals should form a visual region that encloses the object of interest completely. Extensive experiments have demonstrated the superiority of our proposed method compared with the state-of-the-arts.

Abstract:
Along with current multi-scale based detectors, Feature Aggregation and Enhancement (FAE) modules have shown superior performance gains for cutting-edge object detection. However, these hand-crafted FAE modules show inconsistent improvements on face detection, which is mainly due to the significant distribution difference between its training and applying corpus, i.e. COCO vs. WIDER Face. To tackle this problem, we essentially analyse the effect of data distribution, and consequently propose to search an effective FAE architecture, termed AutoFAE by a differentiable architecture search, which outperforms all existing FAE modules in face detection with a considerable margin. Upon the found AutoFAE and existing backbones, a supernet is further built and trained, which automatically obtains a family of detectors under the different complexity constraints. Extensive experiments conducted on popular benchmarks, i.e. WIDER Face and FDDB, demonstrate the state-of-the-art performance-efficiency trade-off for the proposed automatic and scalable face detector (ASFD) family. In particular, our strong ASFD-D6 outperforms the best competitor with AP 96.7/96.2/92.1 on WIDER Face test, and the lightweight ASFD-D0 costs about 3.1 ms, i.e. more than 320 FPS, on the V100 GPU with VGA-resolution images.

Abstract:
Text-to-Face synthesis with multiple captions is still an important yet less addressed problem because of the lack of effective algorithms and large-scale datasets. We accordingly propose a Semantic Embedding and Attention (SEA-T2F) network that allows multiple captions as input to generate highly semantically related face images. With a novel Sentence Features Injection Module, SEA-T2F can integrate any number of captions into the network. In addition, an attention mechanism named Attention for Multiple Captions is proposed to fuse multiple word features and synthesize fine-grained details. Considering text-to-face generation is an ill-posed problem, we also introduce an attribute loss to guide the network to generate sentence-related attributes. Existing datasets for text-to-face are either too small or roughly generated according to attribute labels, which is not enough to train deep learning based methods to synthesize natural face images. Therefore, we build a large-scale dataset named CelebAText-HQ, in which each image is manually annotated with 10 captions. Extensive experiments demonstrate the effectiveness of our algorithm.

Abstract:
Multi-modal machine learning has been a prominent multi-disciplinary research area since its success in complex real-world problems. Empirically, multi-branch fusion models tend to generate better results when there is a high diversity among each branch of the model. However, such experience alone does not guarantee the fusion model's best performance nor have sufficient theoretical support. We present the theoretical estimation of the fusion models' performance by measuring each branch model's performance and the distance between branches based on the analysis of several most popular fusion methods. The theorem is validated empirically by numerical experiments. We further present a branch model selection framework to identify the candidate branches for fusion models to achieve the optimal multi-modal performance by using the theorem. The framework's effectiveness is demonstrated on various datasets by showing how effectively selecting the combination of branch models to attain superior performance.

Abstract:
In this paper, we address multi-modal pretraining of product data in the field of E-commerce. Current multi-modal pretraining methods proposed for image and text modalities lack robustness in the face of modality-missing and modality-noise, which are two pervasive problems of multi-modal product data in real E-commerce scenarios. To this end, we propose a novel method, K3M, which introduces knowledge modality in multi-modal pretraining to correct the noise and supplement the missing of image and text modalities. The modal-encoding layer extracts the features of each modality. The modal-interaction layer is capable of effectively modeling the interaction of multiple modalities, where an initial-interactive feature fusion model is designed to maintain the independence of image modality and text modality, and a structure aggregation module is designed to fuse the information of image, text, and knowledge modalities. We pretrain K3M with three pretraining tasks, including masked object modeling (MOM), masked language modeling (MLM), and link prediction modeling (LPM). Experimental results on a real-world E-commerce dataset and a series of product-based downstream tasks demonstrate that K3M achieves significant improvements in performances than the baseline and state-of-the-art methods when modality-noise or modality-missing exists.

Abstract:
We demonstrate ViDA-MAN, a digital-human agent for multi-modal interaction, which offers realtime audio-visual responses to instant speech inquiries. Compared to traditional text or voice-based system, ViDA-MAN offers human-like interactions (e.g, vivid voice, natural facial expression and body gestures). Given a speech request, the demonstration is able to response with high quality videos in sub-second latency. To deliver immersive user experience, ViDA-MAN seamlessly integrates multi-modal techniques including Acoustic Speech Recognition (ASR), multi-turn dialog, Text To Speech (TTS), talking heads video generation. Backed with large knowledge base, ViDA-MAN is able to chat with users on a number of topics including chit-chat, weather, device control, News recommendations, booking hotels, as well as answering questions via structured knowledge.

Abstract:
Convolutional neural networks (CNNs) have obtained great success in image restoration tasks, like single image denoising, demosaicing, and super-resolution. However, most existing CNN-based methods neglect the diversity of image contents and degradations in the corrupted images and treat channel-wise features equally, thus hindering the representation ability of CNNs. To address this issue, we propose deep mix-order attention networks (MAN) to extract features that capture rich feature statistics within networks. Our MAN is mainly built on simple residual blocks and our mix-order channel attention (MOCA) module, which further consists of feature gating and feature pooling blocks to capture different types of semantic information. With our MOCA, our MAN can be flexible to handle various types of image contents and degradations. Besides, our MAN can be generalized to different image restoration tasks, like image denoising, super-resolution, and demosaicing. Extensive experiments demonstrate that our method obtains favorably against state-of-the-art methods in terms of quantitative and qualitative metrics.

Abstract:
With the advance of the multi-media and multi-modal data, multi-view clustering (MVC) has drawn increasing attentions recently. In this field, one of the most crucial challenges is that the characteristics and qualities of different views usually vary extensively. Therefore, it is essential for MVC methods to find an effective approach that handles the diversity of multiple views appropriately. To this end, a series of MVC methods focusing on how to integrate the loss from each view have been proposed in the past few years. Among these methods, the mainstream idea is assigning weights to each view and then combining them linearly. In this paper, inspired by the effectiveness of non-linear combination in instance learning and the auto-weighted approaches, we propose Non-Linear Fusion for Self-Paced Multi-View Clustering (NSMVC), which is totally different from the the conventional linear-weighting algorithms. In NSMVC, we directly assign different exponents to different views according to their qualities. By this way, the negative impact from the corrupt views can be significantly reduced. Meanwhile, to address the non-convex issue of the MVC model, we further define a novel regularizer-free modality of Self-Paced Learning (SPL), which fits the proposed non-linear model perfectly. Experimental results on various real-world data sets demonstrate the effectiveness of the proposed method.

Abstract:
Recently, deep Hamming hashing methods have been proposed for Hamming space retrieval which enables constant-time search by hash table lookups instead of linear scan. When carrying out Hamming space retrieval, for each query datapoint, there is a Hamming ball centered on the query datapoint, and only the datapoints within the Hamming ball are returned as the relevant ones, while those beyond are discarded directly. Thus, to further enhance the retrieval performance, it is a key point for the Hamming hashing methods to decrease the dissimilar datapoints within the Hamming ball. However, nearly all existing Hamming hashing methods cannot effectively penalize the dissimilar pairs within the Hamming ball to push them out. To tackle this problem, in this paper, we propose a novel Weighted Gaussian Loss based Hamming Hashing, called WGLHH, which introduces a weighted Gaussian loss to optimize hashing model. Specifically, the weighted Gaussian loss consists of three parts: a novel Gaussian-distribution based loss, a novel badly-trained-pair attention mechanism and a quantization loss. The Gaussian-distribution based loss is proposed to effectively penalize the dissimilar pairs within the Hamming ball. The badly-trained-pair attention mechanism is proposed to assign a weight for each data pair, which puts more weight on data pairs whose corresponding hash codes cannot preserve original similarity well, and less on those having already handled well. The quantization loss is used to reduce the quantization error. By incorporating the three parts, the proposed weighted Gaussian loss will penalize significantly on the dissimilar pairs within the Hamming ball to generate more compact hashing codes. Extensive experiments on two benchmark datasets show that the proposed method outperforms the state-of-the-art baselines in image retrieval task.

Abstract:
It has become clear that AI will profoundly transform society. AI will dramatically change the socio-technological landscape, produce seismic economic shifts, and fundamentally reshape the workforce in ways that we are only beginning to grasp. With its imminent arrival, it is critically important to deeply engage with questions around how we should design education in the Age of AI. Fortunately, while we must address the significant challenges posed by AI, we can also leverage AI itself to address these challenges. In this talk we will consider how (and at what rate) AI technologies for education will evolve, discuss emerging innovations in AI-augmented learning environments for formal and informal contexts, and explore what competencies will be elevated in an AI-pervasive workforce. We will discuss near-future AI technologies that leverage advances in natural language processing, computer vision, and machine learning to create narrative-centered learning environments, embodied conversational agents for learning, and multimodal learning analytics. We will conclude by considering what all of these developments suggest for K-12 education and the future of human learning.

Abstract:
Many multimedia developers are exploring for adopting Deep Reinforcement Learning (DRL) techniques in their applications. They however often find such an adoption challenging. Existing DRL libraries provide poor support for prototyping DRL agents (i.e., models), customising the agents, and comparing the performance of DRL agents. As a result, the developers often report low efficiency in developing DRL agents. In this paper, we introduce RLzoo, a new DRL library that aims to make the development of DRL agents efficient. RLzoo provides developers with (i) high-level yet flexible APIs for prototyping DRL agents, and further customising the agents for best performance, (ii) a model zoo where users can import a wide range of DRL agents and easily compare their performance, and (iii) an algorithm that can automatically construct DRL agents with custom components (which are critical to improve agent's performance in custom applications). Evaluation results show that RLzoo can effectively reduce the development cost of DRL agents, while achieving comparable performance with existing DRL libraries.

Abstract:
Visual object localization is the key step in a series of object detection tasks. In the literature, high localization accuracy is achieved with the mainstream strongly supervised frameworks. However, such methods require object-level annotations and are unable to detect objects of unknown categories. Weakly supervised methods face similar difficulties. In this paper, a self-paced learning framework is proposed to achieve accurate object localization on the rank list returned by instance search. The proposed framework mines the target instance gradually from the queries and their corresponding top-ranked search results. Since a common instance is shared between the query and the images in the rank list, the target visual instance can be accurately localized even without knowing what the object category is. In addition to performing localization on instance search, the issue of few-shot object detection is also addressed under the same framework. Superior performance over state-of-the-art methods is observed on both tasks.

Abstract:
Cross-modal matching has attracted growing attention due to the rapid emergence of the multimedia data on the web and social applications. Recently, many re-weighting methods have been proposed for accelerating model training by designing a mapping function from similarity scores to weights. However, these re-weighting methods are difficult to be universally applied in practice since manually pre-set weighting functions inevitably involve hyper-parameters. In this paper, we propose a Meta Self-Paced Network (Meta-SPN) that automatically learns a weighting scheme from data for cross-modal matching. Specifically, a meta self-paced network composed of a fully connected neural network is designed to fit the weight function, which takes the similarity score of the sample pairs as input and outputs the corresponding weight value. Our meta self-paced network considers not only the self-similarity scores, but also their potential interactions (e.g., relative-similarity) when learning the weights. Motivated by the success of meta-learning, we use the validation set to update the meta self-paced network during the training of the matching network. Experiments on two image-text matching benchmarks and two video-text matching benchmarks demonstrate the generalization and effectiveness of our method.

Abstract:
Multimedia content is of predominance in the modern Web era. Investigating how users interact with multimodal items is a continuing concern within the rapid development of recommender systems. The majority of previous work focuses on modeling user-item interactions with multimodal features included as side information. However, this scheme is not well-designed for multimedia recommendation. Specifically, only collaborative item-item relationships are implicitly modeled through high-order item-user-item relations. Considering that items are associated with rich contents in multiple modalities, we argue that the latent semantic item-item structures underlying these multimodal contents could be beneficial for learning better item representations and further boosting recommendation. To this end, we propose a LATent sTructure mining method for multImodal reCommEndation, which we term LATTICE for brevity. To be specific, in the proposed LATTICE model, we devise a novel modality-aware structure learning layer, which learns item-item structures for each modality and aggregates multiple modalities to obtain latent item graphs. Based on the learned latent graphs, we perform graph convolutions to explicitly inject high-order item affinities into item representations. These enriched item representations can then be plugged into existing collaborative filtering methods to make more accurate recommendations. Extensive experiments on three real-world datasets demonstrate the superiority of our method over state-of-the-art multimedia recommendation methods and validate the efficacy of mining latent item-item relationships from multimodal features.

Abstract:
Recently, outfit compatibility modeling, which aims to evaluate the compatibility of a given outfit that comprises a set of fashion items, has gained growing research attention. Although existing studies have achieved prominent progress, most of them overlook the essential global outfit representation learning, and the hidden complementary factors behind the outfit compatibility uncovering. Towards this end, we propose an Outfit Compatibility Modeling scheme via Complementary Factorization, termed as OCM-CF. In particular, OCM-CF consists of two key components: context-aware outfit representation modeling and hidden complementary factors modeling. The former works on adaptively learning the global outfit representation with graph convolutional networks and the multi-head attention mechanism, where the item context is fully explored. The latter targets at uncovering the latent complementary factors with multiple parallel networks, each of which corresponds to a factor-oriented context-aware outfit representation modeling. In this part, a new orthogonality-based complementarity regularization is proposed to encourage the learned factors to complement each other and better characterize the outfit compatibility. Finally, the outfit compatibility is obtained by summing all the hidden complementary factor-oriented outfit compatibility scores, each of which is derived from the corresponding outfit representation. Extensive experiments on two real-world datasets demonstrate the superiority of our OCM-CF over the state-of-the-art methods.

Abstract:
Sign language translation as a kind of technology with profound social significance has attracted growing researchers' interest in recent years. However, the existing sign language translation methods need to read all the videos before starting the translation, which leads to a high inference latency and also limits their application in real-life scenarios. To solve this problem, we propose SimulSLT, the first end-to-end simultaneous sign language translation model, which can translate sign language videos into target text concurrently. SimulSLT is composed of a text decoder, a boundary predictor, and a masked encoder. We 1) use the wait-k strategy for simultaneous translation. 2) design a novel boundary predictor based on the integrate-and-fire module to output the gloss boundary, which is used to model the correspondence between the sign language video and the gloss. 3) propose an innovative re-encode method to help the model obtain more abundant contextual information, which allows the existing video features to interact fully. The experimental results conducted on the RWTH-PHOENIX-Weather 2014T dataset show that SimulSLT achieves BLEU scores that exceed the latest end-to-end non-simultaneous sign language translation model while maintaining low latency, which proves the effectiveness of our method.

Abstract:
In this paper, we propose an invariant learning method for facial landmark mining in a self-supervised manner. The conventional methods mostly train with raw data of paired facial appearances and landmarks, assuming that they are evenly distributed. However, assumptions like this tend to lead to failures in challenging cases even undergo costly training since they usually don't hold in real-world scenarios. To address this issue, our model achieves to be invariant to facial biases by learning through the landmark-anchored distributions. Specifically, we generate faces from these distributions, then group them based on the appearance sources and the probe facial landmarks into intra-identities and intra-landmarks classes, respectively. Thus, we construct intra-class invariance losses to disentangle the spatial structures from appearances. In addition, we adopt a reconstruction loss to produce more realistic faces with probe landmarks. Extensive experimental results on four standard facial landmark datasets demonstrate that our method achieves compelling performance compared with supervised and unsupervised methods.

Abstract:
Pansharpening aims to fuse a high spatial resolution panchromatic (PAN) image and a low resolution multispectral (LR-MS) image to obtain a multispectral image with the same spatial resolution as the PAN image. Thanks to the flexible structure of convolution neural networks (CNNs), they have been successfully applied to the problem of pansharpening. However, most of the existing methods only simply feed the up-sampled LR-MS into the CNNs and ignore the spatial distortion caused by direct up-sampling. In this paper, we propose an explicit spectral-to-spatial convolution (SSconv) that aggregates spectral features into the spatial domain to perform the up-sampling operation, which can get better performance than the direct up-sampling. Furthermore, SSconv is embedded into a multiscale U-shaped convolution neural network (MUCNN) for fully utilizing the multispectral information of involved images. In particular, multiscale injection branch and mixed loss on cross-scale levels are employed to fuse pixel-wise image information. Benefiting from the distortion-free property of SSconv, the proposed MUCNN can generate state-of-the-art performance with a simple structure, both on reduced-resolution and full-resolution datasets acquired from WorldView-3 and GaoFen-2. Please find the code from the project page.

Abstract:
Image de-distortion is very important because distortions will degrade the image quality significantly. It can benefit many computational visual media applications that are primarily designed for high-quality images. In order to address this challenging issue, we propose a stacked semantically-guided network, which is the first try on this task. It can capture and restore the distortions around the humans and the adjacent background effectively with the stacked network architecture and the semantically-guided scheme. In addition, a discriminative restoration loss function is proposed to recover different distorted regions in the images discriminatively. As another important effort, we construct a large-scale dataset for image de-distortion. Extensive qualitative and quantitative experiments show that our proposed method achieves a superior performance compared with the state-of-the-art approaches.

Abstract:
Sharing short personalized videos to various social media networks has become quite popular in recent years. This raises the need for digital retouching of portraits in videos. However, applying portrait image editing directly on portrait video frames cannot generate smooth and stable video sequences. To this end, we present a robust and easy-to-use parametric method to reshape the portrait in a video to produce smooth retouched results. Given an input portrait video, our method consists of two main stages: stabilized face reconstruction, and continuous video reshaping. In the first stage, we start by estimating face rigid pose transformations across video frames. Then we jointly optimize multiple frames to reconstruct an accurate face identity, followed by recovering face expressions over the entire video. In the second stage, we first reshape the reconstructed 3D face using a parametric reshaping model reflecting the weight change of the face, and then utilize the reshaped 3D face to guide the warping of video frames. We develop a novel signed distance function based dense mapping method for the warping between face contours before and after reshaping, resulting in stable warped video frames with minimum distortions. In addition, we use the 3D structure of the face to correct the dense mapping to achieve temporal consistency. We generate the final result by minimizing the background distortion through optimizing a content-aware warping mesh. Extensive experiments show that our method is able to create visually pleasing results by adjusting a simple reshaping parameter, which facilitates portrait video editing for social media and visual effects.

Abstract:
Human attributes prediction in visual media is a well-researched topic with a major focus on human faces. However, face images are often of high privacy concern as they can reveal an individual's identity. How to balance this trade-off between privacy and utility is a key problem among researchers and practitioners. In this study, we make one of the first attempts to investigate the human attributes (emotion, age, and gender) prediction under the different de-identification (eyes, lower-face, face, and head obfuscation) privacy scenarios. We first constructed the Diversity in People and Context Dataset (DPaC). We then performed a human study with eye-tracking on how humans recognize facial attributes without the presence of face and context. Results show that in an image, situational context is informative of a target's attributes. Motivated by our human study, we proposed a multi-tasking deep learning model - Context-Guided Human Attributes Prediction (CHAPNet), for human attributes prediction under privacy-preserving conditions. Extensive experiments on DPaC and three commonly used benchmark datasets demonstrate the superiority of CHAPNet in leveraging the situational context for a better interpretation of a target's attributes without the full presence of the target's face. Our research demonstrates the feasibility of visual analytics under de-identification for privacy.

Abstract:
Image captioning is a challenging task that combines computer vision and natural language processing for generating a textual description of the content within an image. Recently, Transformer-based encoder-decoder architectures have shown great success in image captioning, where multi-head attention mechanism is utilized to capture the contextual interactions between object regions. However, such methods regard region features as a bag of tokens without considering the directional relationships between them, making it hard to understand the relative position between objects in the image and generate correct captions effectively. In this paper, we propose a novel Direction Relation Transformer to improve the orientation perception between visual features by incorporating the relative direction embedding into multi-head attention, termed DRT. We first generate the relative direction matrix according to the positional information of the object regions, and then explore three forms of direction-aware multi-head attention to integrate the direction embedding into Transformer architecture. We conduct experiments on challenging Microsoft COCO image captioning benchmark. The quantitative and qualitative results demonstrate that, by integrating the relative directional relation, our proposed approach achieves significant improvements over all evaluation metrics compared with baseline model, e.g., DRT improves task-specific metric CIDEr score from 129.7% to 133.2% on the offline ''Karpathy'' test split.

Abstract:
Cross-modal retrieval between texts and videos is important yet challenging. Until recently, previous works in this domain typically rely on learning a common space to match the text and video, but it is difficult to match due to the semantic gap between videos and texts. Although some methods employ coarse-to-fine or multi-expert networks to encode one or more common spaces for easier matching, they almost directly optimize one matching space, which is challenging, because of the huge semantic gap between different modalities. To address this issue, we aim at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and propose a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel encoding network for videos and texts, and design some auxiliary common spaces, which are mapped by the outputs of encoders in different levels. Then all the common spaces are jointly trained end to end. In this way, the model can effectively encode videos and texts into a fusion common space by a progressive paradigm. Experimental results on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages of our PSM, which achieves significant performance improvement compared with state-of-the-art approaches.

Abstract:
The well-known collaborative filtering (CF) models typically optimize a single objective summed over all historical user-item interactions. Due to inevitable imbalances and biases in real-world data, they may develop a policy that unfairly discriminates against certain subgroups with low sample frequencies. To balance overall recommendation performance and fairness, prevalent solutions apply fairness constraints or regularizations to enforce equality of certain performance across different subgroups. However, simply enforcing equality of performance may lead to large performance degradation of those advantaged subgroups. To address this issue, we formulate a constrained Multi-Objective Optimization (MOO) problem. In contrast to the single objective, we treat the performance of each subgroup equivalently as an objective. This ensures that the imbalanced subgroup sample frequency does not affect the gradient information. We further propose fairness constraints to limit the search space to obtain more balanced solutions. To solve the constrained MOO problem, a gradient-based constrained MOO algorithm is proposed to seek a proper Pareto optimal solution for the performance trade-off. Extensive experiments on synthetic and real-world datasets show that our approach could help improve the recommendation accuracy of disadvantaged groups, while not damaging the overall performance.

Abstract:
Non-maximum suppression (NMS) is widely used in object detection pipelines for removing duplicated bounding boxes. The inconsistency between the confidence for NMS and the real localization confidence seriously affects detection performance. Prior works propose to predict Intersection-over-Union (IoU) between bounding boxes and corresponding ground-truths to improve NMS, while accurately predicting IoU is still a challenging problem. We argue that the complex definition of IoU and feature misalignment make it difficult to predict IoU accurately. In this paper, we propose a novel Decoupled IoU Regression (DIR) model to handle these problems. The proposed DIR decouples the traditional localization confidence metric IoU into two new metrics, Purity and Integrity. Purity reflects the proportion of the object area in the detected bounding box, and Integrity refers to the completeness of the detected object area. Separately predicting Purity and Integrity can divide the complex mapping between the bounding box and its IoU into two clearer mappings and model them independently. In addition, a simple but effective feature realignment approach is also introduced to make the IoU regressor work in a hindsight manner, which can make the target mapping more stable. The proposed DIR can be conveniently integrated with existing two-stage detectors and significantly improve their performance. Through a simple implementation of DIR with HTC, we obtain 51.3% AP on MS COCO benchmark, which outperforms previous methods and achieves state-of-the-art.

Abstract:
AI music composition is one of the most attractive and important topics in artificial intelligence, music, and multimedia. The typical tasks in AI music composition include melody generation, song writing, accompaniment generation, arrangement, performance generation, timbre rendering, sound generation, and singing voice synthesis, which cover different modalities (e.g., symbolic music score, sound) and well match to the theme of ACM Multimedia. As the rapid development of artificial intelligence techniques such as content creation and deep learning, AI based music composition has achieved rapid progress, but still encountered a lot of challenges. A thorough introduction and review on the basics, the research progress, as well as how to address the challenges in AI music composition are timely and necessary for a broad audience working on artificial intelligence, music, and multimedia. In this tutorial, we will first introduce the background of AI music composition, including music basics and deep learning techniques for music composition. Then we will introduce AI music composition from two perspectives: 1) key components, which include music score generation, music performance generation, and music sound generation; 2) advanced topics, which include music structure/form/style/emotion modeling, timbre synthesis/transfer/mixing, etc. At last, we will point out some research challenges and future directions in AI music composition. This tutorial can serve both academic researchers and industry practitioners working on AI music composition.

Abstract:
In this paper, we will introduce the recent progress in deep learning based visual data compression, including image compression, video compression and point cloud compression. In the past few years, deep learning techniques have been successfully applied to various computer vision and image processing applications. However, for the data compression task, the traditional approaches (i.e., block based motion estimation and motion compensation, etc.) are still widely employed in the mainstream codecs. Considering the powerful representation capability of neural networks, it is feasible to improve the data compression performance by employing the advanced deep learning technologies. To this end, the deep leaning based compression approaches have recently received increasing attention from both academia and industry in the field of computer vision and signal processing.

Abstract:
Clustering is the task of instance grouping so that similar ones are grouped into the same cluster, while dissimilar ones are in different clusters. However, such similarity is a local concept in regard to different clusters and their relevant feature space. This work aims to discover clusters by exploring feature association and instance similarity concurrently. We propose a deep clustering framework that can localize the search for relevant features appertaining to different clusters. In turn, this allows for measuring instance similarity that exist in multiple, possibly overlapping, feature subsets, which contribute to more accurate clustering of instances. Additionally, the relevant features of each cluster endow interpretability of clustering results. Experiments on text and image datasets show that our method outperforms existing state-of-the-art baselines.

Abstract:
Currently, video semantic segmentation mainly faces two challenges: 1) the demand of temporal consistency; 2) the balance between segmentation accuracy and inference efficiency. For the first challenge, existing methods usually use optical flow to capture the temporal relation in consecutive frames and maintain the temporal consistency, but the low inference speed by means of optical flow limits the real-time applications. For the second challenge, flow based key frame warping is one mainstream solution. However, the unbalanced inference latency of flow-based key frame warping makes it unsatisfactory for real-time applications. Considering the segmentation accuracy and inference efficiency, we propose a novel Sparse Temporal Transformer (STT) to bridge temporal relation among video frames adaptively, which is also equipped with query selection and key selection. The key selection and query selection strategies are separately applied to filter out temporal and spatial redundancy in our temporal transformer. Specifically, our STT can reduce the time complexity of temporal transformer by a large margin without harming the segmentation accuracy and temporal consistency. Experiments on two benchmark datasets, Cityscapes and Camvid, demonstrate that our method achieves the state-of-the-art segmentation accuracy and temporal consistency with comparable inference speed.

Abstract:
Multi-task pixel perception is one of the most important topics in the field of machine intelligence. Inspired by the observation of cross-task interdependencies of visual patterns, we propose a multi-task vision pattern transformation (VPT) method to adaptively correlate and transfer cross-task visual patterns by leveraging the powerful transformer mechanism. To better transfer visual patterns, specifically, we build two types of pattern transformation based on the statistic prior that the affinity relations across tasks are correlated. One aims to transfer feature patterns for the integration of different task features; the other aims to exchange structure patterns for mining and leveraging the latent interaction cues. These two types of transformations are encapsulated into two VPT units, which provide universal matching interfaces for multi-task learning, complement each other to guide the transmission of feature/structure patterns, and finally realize an adaptive selection of important patterns across tasks. Extensive experiments on the joint learning of semantic segmentation, depth prediction and surface normal estimation demonstrate that our proposed method is more effective than those baselines and achieve the state-of-that-art performance in three pixel-level visual tasks.

Abstract:
Given a set of multiple view videos, which records the motion trajectory of an object, we propose to find out the objects' kinematic formulas with neural rendering techniques. For example, if the input multiple view videos record the free fall motion of an object with different initial speed v, the network aims to learn its kinematics: Δ=vt-1over 2 gt2, where Δ, g and t are displacement, gravitational acceleration and time. To achieve this goal, we design a novel framework consisting of a motion network and a differentiable renderer. For the differentiable renderer, we employ Neural Radiance Field (NeRF) since the geometry is implicitly modeled by querying coordinates in the space. The motion network is composed of a series of blending functions and linear weights, enabling us to analytically derive the kinematic formulas after training. The proposed framework is trained end to end and only requires knowledge of cameras' intrinsic and extrinsic parameters. To validate the proposed framework, we design three experiments to demonstrate its effectiveness and extensibility. The first experiment is the video of free fall and the framework can be easily combined with the principle of parsimony, resulting in the correct free fall kinematics. The second experiment is on the large angle pendulum which does not have analytical kinematics. We use the differential equation controlling pendulum dynamics as a physical prior in the framework and demonstrate that the convergence speed becomes much faster. Finally, we study the explosion animation and demonstrate that our framework can well handle such black-box-generated motions.

Abstract:
Depression detection research has increased over the last few decades, one major bottleneck of which is the limited data availability and representation learning. Recently, self-supervised learning has seen success in pretraining text embeddings and has been applied broadly on related tasks with sparse data, while pretrained audio embeddings based on self-supervised learning are rarely investigated. This paper proposes DEPA, a self-supervised, pretrained dep ression a udio embedding method for depression detection. An encoder-decoder network is used to extract DEPA on in-domain depressed datasets (DAIC and MDD) and out-domain (Switchboard, Alzheimer's) datasets. With DEPA as the audio embedding extracted at response-level, a significant performance gain is achieved on downstream tasks, evaluated on both sparse datasets like DAIC and large major depression disorder dataset (MDD). This paper not only exhibits itself as a novel embedding extracting method capturing response-level representation for depression detection but more significantly, is an exploration of self-supervised learning in a specific task within audio processing.

Abstract:
Depth estimation is a structure learning problem. The affinity among neighbouring pixels plays an important role in inferring depth values. In this paper, we propose to learn structure affinity in both spatial and temporal domain for accurate depth estimation from monocular videos. Specifically, we first propose a convolutional spatial temporal propagation network (CSTPN) that learns affinity among neighbouring video frames. Secondly, we employ a structure knowledge distillation scheme that transfers the spatial temporal affinity learned by cumbersome network to compact network. By calculating pixel-wise similarities between neighboring frames and neighbouring sequences, our knowledge distillation scheme efficiently captures both short-term and long-term spatial temporal affinity. Finally, we apply a warping loss based on optical flow between video frames to further enforce the temporal affinity. Experiment results show that our proposed depth estimation approach outperform the state-of-the-art methods on both indoor and outdoor benchmark datasets.

Abstract:
In this paper we focus on landscape animation, which aims to generate time-lapse videos from a single landscape image. Motion is crucial for landscape animation as it determines how objects move in videos. Existing methods are able to generate appealing videos by learning motion from real time-lapse videos. However, current methods suffer from inaccurate motion generation, which leads to unrealistic video results. To tackle this problem, we propose a model named FGLA to generate high-quality and realistic videos by learning Fine-Grained motion embedding for Landscape Animation. Our model consists of two parts: (1) a motion encoder which embeds time-lapse motion in a fine-grained way. (2) a motion generator which generates realistic motion to animate input images. To train and evaluate on diverse time-lapse videos, we build the largest high-resolution Time-lapse video dataset with Diverse scenes, namely Time-lapse-D, which includes 16,874 video clips with over 10 million frames. Quantitative and qualitative experimental results demonstrate the superiority of our method. In particular, our method achieves relative improvements by 19% on LIPIS and 5.6% on FVD compared with state-of-the-art methods on our dataset. A user study carried out with 700 human subjects shows that our approach visually outperforms existing methods by a large margin.

Abstract:
Stereo Video Super-Resolution (StereoVSR) aims to generate high-resolution video steams from two low-resolution videos under stereo settings. Existing video super-resolution and stereo image super-resolution techniques can be extended to tackle the StereoVSR task, yet they cannot make full use of the multi-view and temporal information to achieve satisfactory performance. In this paper, we propose a novel Stereo Video Super-Resolution Network (SVSRNet) to fulfill the StereoVSR task via exploiting view-temporal correlations. First, we devise a view-temporal attention module (VTAM) to integrate the information of cross-time-cross-view for constructing high-resolution stereo videos. Second, we propose a spatial-temporal fusion module (STFM), which aggregates the information across time in intra-view to emphasize important features for subsequent restoration. In addition, we design a view-temporal consistency loss function to enforce consistency constraint of superresolved stereo videos. Comprehensive experimental results demonstrate that our method generates superior results.

Abstract:
Food image segmentation is a critical and indispensible task for developing health-related applications such as estimating food calories and nutrients. Existing food image segmentation models are underperforming due to two reasons: (1) there is a lack of high quality food image datasets with fine-grained ingredient labels and pixel-wise location masks---the existing datasets either carry coarse ingredient labels or are small in size; and (2) the complex appearance of food makes it difficult to localize and recognize ingredients in food images, e.g., the ingredients may overlap one another in the same image, and the identical ingredient may appear distinctly in different food images.

Abstract:
Natural image matting estimates the alpha values of unknown regions in the trimap. Recently, deep learning based methods propagate the alpha values from the known regions to unknown regions according to the similarity between them. However, we find that more than 50% pixels in the unknown regions cannot be correlated to pixels in known regions due to the limitation of small effective reception fields of common convolutional neural networks, which leads to inaccurate estimation when the pixels in the unknown regions cannot be inferred only with pixels in the reception fields. To solve this problem, we propose Long-Range Feature Propagating Network (LFPNet), which learns the long-range context features outside the reception fields for alpha matte estimation. Specifically, we first design the propagating module which extracts the context features from the downsampled image. Then, we present Center-Surround Pyramid Pooling (CSPP) that explicitly propagates the context features from the surrounding context image patch to the inner center image patch. Finally, we use the matting module which takes the image, trimap and context features to estimate the alpha matte. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on the AlphaMatting and Adobe Image Matting datasets.

Abstract:
Adaptive and flexible image editing is a desirable function of modern generative models. In this work, we present a generative model with auto-encoder architecture for per-region style manipulation. We apply a code consistency loss to enforce an explicit disentanglement between content and style latent representations, making the content and style of generated samples consistent with their corresponding content and style references. The model is also constrained by a content alignment loss to ensure the foreground editing will not interfere background contents. As a result, given interested region masks provided by users, our model supports foreground region-wise style transfer. Specially, our model receives no extra annotations such as semantic labels except for self-supervision. Extensive experiments show the effectiveness of the proposed method and exhibit the flexibility of the proposed model for various applications, including region-wise style editing, latent space interpolation, cross-domain style transfer.

Abstract:
Image virtual try-on task has abundant applications and has become a hot research topic recently. Existing 2D image-based virtual try-on methods aim to transfer a target clothing image onto a reference person, which has two main disadvantages: cannot control the size and length precisely; unable to accurately estimate the user's figure in the case of users wearing thick clothing, resulting in inaccurate dressing effect. In this paper, we put forward an akin task that aims to dress clothing for underwear models. To solve the above drawbacks, we propose a Shape Controllable Virtual Try-On Network (SC-VTON), where a graph attention network integrates the information of model and clothing to generate the warped clothing image. In addition, the control points are incorporated into SC-VTON for the desired clothing shape. Furthermore, by adding a Splitting Network and a Synthesis Network, we can use in-shop clothing/model pair data to help optimize the deformation module and generalize the task to the typical virtual try-on task. Extensive experiments show that the proposed method can achieve accurate shape control. Meanwhile, compared with other methods, our method can generate high-resolution results with detailed textures, which can be applied in real applications.

Abstract:
Existing few-shot learning (FSL) methods usually assume base classes and novel classes are from the same domain (in-domain setting). However, in practice, it may be infeasible to collect sufficient training samples for some special domains to construct base classes. To solve this problem, cross-domain FSL (CDFSL) is proposed very recently to transfer knowledge from general-domain base classes to special-domain novel classes. Existing CDFSL works mostly focus on transferring between near domains, while rarely consider transferring between distant domains, which is in practical need as any novel classes could appear in real-world applications, and is even more challenging. In this paper, we study a challenging subset of CDFSL where the novel classes are in distant domains from base classes, by revisiting the mid-level features, which are more transferable yet under-explored in main stream FSL work. To boost the discriminability of mid-level features, we propose a residual-prediction task to encourage mid-level features to learn discriminative information of each sample. Notably, such mechanism also benefits the in-domain FSL and CDFSL in near domains. Therefore, we provide two types of features for both cross- and in-domain FSL respectively, under the same training framework. Experiments under both settings on six public datasets, including two challenging medical datasets, validate the our rationale and demonstrate state-of-the-art performance. Code will be released.

Abstract:
Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information to recognize seen and unseen samples, where unseen classes are not observable during training. It is natural to derive generative models and hallucinate training samples for unseen classes based on the knowledge learned from the seen samples. However, most of these models suffer from the generation shifts, where the synthesized samples may drift from the real distribution of unseen data. In this paper, we propose a novel generative flow framework that consists of multiple conditional affine coupling layers for learning unseen data generation. In particular, we identify three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance collapse, and structure disorder and address them respectively. First, to reinforce the correlations between the generated samples and their corresponding attributes, we explicitly embed the semantic information into the transformations in each coupling layer. Second, to recover the intrinsic variance of the real unseen features, we introduce a visual perturbation strategy to diversify the generated data and hereby help adjust the decision boundary of the classifiers. Third, a relative positioning strategy is proposed to revise the attribute embeddings, guiding them to fully preserve the inter-class geometric structure and further avoid structure disorder in the semantic space. Experimental results demonstrate that GSMFlow achieves the state-of-the-art performance on GZSL.

Abstract:
This work proposes a graph search based method for human motion sequence synthesis, complementing the modern generative model (e.g., variational auto-encoder or Gaussian process) based solutions that currently dominate this task and showing strong advantages at several aspects. The cornerstone of our method is a novel representation which we dub as motion graph. Each motion graph is scaffolded by a set of realistic human motion sequences (e.g., all training data in the Human3.6M benchmark). We devise a scheme that adds transition edges across different motion sequences, enabling more longer and diverse routes in the motion graph. Crucially, the proposed motion graph bridges the problem of human motion synthesis with graph-oriented combinatorial optimization, by naturally treating pre-specified starting or ending pose in human pose synthesis as end-points of the retrieved graph path. Based on a jump-sensitive graph path search algorithm proposed in this paper, our model can efficiently solve human motion completion over the motion graphs. In contrast, existing methods are mainly effective for human motion prediction and inadequate to impute missing sequences while jointly satisfying the two constraints of pre-specified starting / ending poses. For the case of only specifying the starting pose (i.e., human motion prediction), a forward graph walking from the starting node is first performed to sample a diverse set of ending nodes on the motion graph, each of which defines a motion completion problem. We conduct comprehensive experiments on two large-scale benchmarks (Human3.6M and HumanEva-I). The proposed method clearly proves to be superior in terms of several metrics, including the diversity of generated human motion sequences, affinity to real poses, and cross-scenario generalization etc.

Abstract:
While recent researches on computational 3D scene synthesis have achieved impressive results, automatically synthesized scenes do not guarantee satisfaction of end users. On the other hand, manual scene modelling can always ensure high quality, but requires a cumbersome trial-and-error process. In this paper, we bridge the above gap by presenting a data-driven 3D scene synthesis framework that can intelligently infer objects to the scene by incorporating and simulating user preferences with minimum input. While the cursor is moved and clicked in the scene, our framework automatically selects and transforms suitable objects into scenes in real time. This is based on priors learnt from the dataset for placing different types of objects, and updated according to the current scene context. Through extensive experiments we demonstrate that our framework outperforms the state-of-the-art on result aesthetics, and enables effective and efficient user interactions.

Abstract:
In video transmission applications, video signals are transmitted over lossy channels, resulting in low-quality received signals. To re- store videos on recipient edge devices in real-time, we introduce an efficient video restoration network, EVRNet. EVRNet efficiently allocates parameters inside the network using alignment, differential, and fusion modules. With extensive experiments on different video restoration tasks (deblocking, denoising, and super-resolution), we demonstrate that EVRNet delivers competitive performance to existing methods with significantly fewer parameters and MACs. For example, EVRNet has 260× fewer parameters and 958× fewer MACs than enhanced deformable convolution-based video restoration net- work (EDVR) for 4× video super-resolution while its SSIM score is 0.018 less than EDVR. We also evaluated the performance of EVR-Net under multiple distortions on unseen dataset to demonstrate its ability in modeling variable-length sequences under both camera and object motion.

Abstract:
We propose AI-Lyricist: a system to generate novel yet meaningful lyrics given a required vocabulary and a MIDI file as inputs. This task involves multiple challenges, including automatically identifying the melody and extracting a syllable template from multi-channel music, generating creative lyrics that match the input music's style and syllable alignment, and satisfying vocabulary constraints. To address these challenges, we propose an automatic lyrics generation system consisting of four modules: (1) A music structure analyzer to derive the musical structure and syllable template from a given MIDI file, utilizing the concept of expected syllable number to better identify the melody, (2) a SeqGAN-based lyrics generator optimized by multi-adversarial training through policy gradients with twin discriminators for text quality and syllable alignment, (3) a deep coupled music-lyrics embedding model to project music and lyrics into a joint space to allow fair comparison of both melody and lyric constraints, and a module called (4) Polisher, to satisfy vocabulary constraints by applying a mask to the generator and substituting the words to be learned. We trained our model on a dataset of over 7,000 music-lyrics pairs, enhanced with manually annotated labels in terms of theme, sentiment and genre. Both objective and subjective evaluations show AI-Lyricist's superior performance against the state-of-the-art for the proposed tasks.

Abstract:
Text recognition is the key pillar for many real-world multimedia applications. Existing text recognition approaches focus on recognizing isolated instances, whose text fields are visually separated and have no interference with each other. Moreover, these approaches cannot handle overlapped instances that often appear in sheets like invoices, receipts and math exercises, where printed templates are generated beforehand and extra contents are added afterward on existing texts. In this paper, we aim to tackle this problem by proposing RecycleNet, which automatically extracts and reconstructs overlapped instances by fully recycling the intersecting pixels that used to be obstacles for recognition. RecycleNet parallels to existing recognition systems, and serves as a plug-and-play module to boost recognition performance with zero-effort. We also released an OverlapText-500 dataset, which helps to boost the design of better overlapped text recovery and recognition solutions.

Abstract:
Weakly-supervised video grounding has been investigated to ground textual phases in video content with only video-sentence pairs provided during training, for the lack of prohibitively costly bounding box annotations. Existing methods cast this task into a frame-level multiple instance learning (MIL) problem with the ranking loss. While an object might appear sparsely across multiple frames, causing uncertain false-positive frames. Thus, directly computing the average loss of all frames is inadequate in video domain. Moreover, the positive and negative pairs are equally coupling in ranking loss, so that it is impossible to handle false-positive frames individually. Additionally, naive inner production is suboptimal for the similarity measure of cross domains. To solve these issues, we propose a novel AsyNCE loss to flexibly disentangle the positive pairs from negative ones in frame-level MIL, which allows for mitigating the uncertainty of false-positive frames effectively. Besides, a cross-modal transformer block is introduced to purify the text feature by image frame context, generating a visual-guided text feature for better similarity measure. Extensive experiments on YouCook2, RoboWatch and WAB datasets demonstrate the superiority and robustness of our method over state-of-the-art methods.

Abstract:
Current video retrieval systems on mobile devices cannot process complex natural language queries, especially if they contain personalized concepts, such as proper names. To address these shortcomings, we propose an efficient and privacy-preserving video retrieval system that works well with personalized queries containing proper names, without re-training using personalized labelled data from users. Our system first computes an initial ranking of a video collection by using a generic attention-based video-text matching model (i.e., a model designed for non-personalized queries), and then uses a face detector to conduct personalized adjustments to these initial rankings. These adjustments are done by reasoning over the face information from the detector and the attention information provided by the generic model. We show that our system significantly outperforms existing keyword-based retrieval systems, and achieves comparable performance to the generic matching model fine-tuned on plenty of labelled data. Our results suggest that the proposed system can effectively capture both semantic context and personalized information in queries.

Abstract:
The low-latency streams captured by event cameras have shown impressive potential in addressing vision tasks such as video reconstruction and optical flow estimation. However, these tasks often require massive training event streams, which are expensive to collect and largely bypassed by recently proposed event camera simulators. To align the statistics of synthetic events with that of target event cameras, existing simulators often need to be heuristically tuned with elaborative manual efforts and thus become incompetent to automatically adapt to various domains. To address this issue, this work proposes one of the first learning-based, domain-adaptive event simulator. Given a specific domain, the proposed simulator learns pixel-wise distributions of event contrast thresholds that, after stochastic sampling and paralleled rendering, can generate event representations well aligned with those from the data from realistic event cameras. To achieve such domain-specific alignment, we design a novel divide-and-conquer discrimination scheme that adaptively evaluates the synthetic-to-real consistency of event representations according to the local statistics of images and events. Trained with the data synthesized by the proposed simulator, the performances of state-of-the-art event-based video reconstruction and optical flow estimation approaches are boosted up to 22.9% and 2.8%, respectively. In addition, we show significantly improved domain adaptation capability over existing event simulators and tuning strategies, consistently on three real event datasets.

Abstract:
Deep learning has achieved great success in a wide spectrum of multimedia applications such as image classification, natural language processing and multimodal data analysis. Recent years have seen the development of many deep learning frameworks that provide a high-level programming interface for users to design models, conduct training and deploy inference. However, it remains challenging to build an efficient end-to-end multimedia application with most existing frameworks. Specifically, in terms of usability, it is demanding for non-experts to implement deep learning models, obtain the right settings for the entire machine learning pipeline, manage models and datasets, and exploit external data sources all together. Further, in terms of adaptability, elastic computation solutions are much needed as the actual serving workload fluctuates constantly, and scaling the hardware resources to handle the fluctuating workload is typically infeasible. To address these challenges, we introduce SINGA-Easy, a new deep learning framework that provides distributed hyper-parameter tuning at the training stage, dynamic computational cost control at the inference stage, and intuitive user interactions with multimedia contents facilitated by model explanation. Our experiments on the training and deployment of multi-modality data analysis applications show that the framework is both usable and adaptable to dynamic inference loads. We implement SINGA-Easy on top of Apache SINGA and demonstrate our system with the entire machine learning life cycle.

Abstract:
Multi-view clustering and multi-view dimension reduction explore ubiquitous and complementary information between multiple features to enhance the clustering, recognition performance. However, multi-view clustering and multi-view dimension reduction are treated independently, ignoring the underlying correlations between them. In addition, previous methods mainly focus on using the tensor nuclear norm for low-rank representation to explore the high correlation of multi-view features, which often causes the estimation bias of the tensor rank. To overcome these limitations, we propose the partial tubal nuclear norm regularized multi-view learning (PTN2ML) method, in which the partial tubal nuclear norm as a non-convex surrogate of the tensor tubal multi-rank, only minimizes the partial sum of the smaller tubal singular values to preserve the low-rank property of the self-representation tensor. PTN2ML pursues the latent representation from the projection space rather than from the input space to reveal the structural consensus and suppress the disturbance of noisy data. The proposed method can be efficiently optimized by the alternating direction method of multipliers. Extensive experiments, including multi-view clustering and multi-view dimension reduction substantiate the superiority of the proposed methods beyond state-of-the-arts.

Abstract:
Cloud gaming enables users to play games on virtually any device. This is achieved by offloading the game rendering and encoding to cloud datacenters. As game resolutions and frame rates increase, cloud gaming platforms face a major challenge to stream high quality games due to the high bandwidth and low latency requirements. In this paper, we propose a new video encoding pipeline, called DeepGame, for cloud gaming platforms to reduce the bandwidth requirements with limited to no impact on the player quality of experience. DeepGame learns the player's contextual interest in the game and the temporal correlation of that interest using a spatio-temporal deep neural network. Then, it encodes various areas in the video frames with different quality levels proportional to their contextual importance. DeepGame does not change the source code of the video encoder or the video game, and it does not require any additional hardware or software at the client side. We implemented DeepGame in an open-source cloud gaming platform and evaluated its performance using multiple popular games. We also conducted a subjective study with real players to demonstrate the potential gains achieved by DeepGame and its practicality. Our results show that DeepGame can reduce the bandwidth requirements by up to 36% compared to the baseline encoder, while maintaining the same level of perceived quality for players and running in real time.

Abstract:
Noisy labels, resulting from mistakes in manual labeling or webly data collecting for supervised learning, can cause neural networks to overfit the misleading information and degrade the generalization performance. Self-supervised learning works in the absence of labels and thus eliminates the negative impact of noisy labels. Motivated by co-training with both supervised learning view and self-supervised learning view, we propose a simple yet effective method called Co-learning for learning with noisy labels. Co-learning performs supervised learning and self-supervised learning in a cooperative way. The constraints of intrinsic similarity with the self-supervised module and the structural similarity with the noisily-supervised module are imposed on a shared common feature encoder to regularize the network to maximize the agreement between the two constraints. Co-learning is compared with peer methods on corrupted data from benchmark datasets fairly, and extensive results are provided which demonstrate that Co-learning is superior to many state-of-the-art approaches.

Abstract:
A point cloud serves as a representation of the surface of a three-dimensional (3D) shape. Deep generative models have been adapted to model their variations typically using a map from a ball-like set of latent variables. However, previous approaches did not pay much attention to the topological structure of a point cloud, despite that a continuous map cannot express the varying numbers of holes and intersections. Moreover, a point cloud is often composed of multiple subparts, and it is also difficult to express. In this study, we propose ChartPointFlow, a flow-based generative model with multiple latent labels for 3D point clouds. Each label is assigned to points in an unsupervised manner. Then, a map conditioned on a label is assigned to a continuous subset of a point cloud, similar to a chart of a manifold. This enables our proposed model to preserve the topological structure with clear boundaries, whereas previous approaches tend to generate blurry point clouds and fail to generate holes. The experimental results demonstrate that ChartPointFlow achieves state-of-the-art performance in terms of generation and reconstruction compared with other point cloud generators. Moreover, ChartPointFlow divides an object into semantic subparts using charts, and it demonstrates superior performance in case of unsupervised segmentation.

Abstract:
Apercevoir is an artwork that can perceive its environmental perturbations and convert them into a spatial sound field with location information. It consists of multiple plant cyborgs comprised of a Mimosa Pudica (sensitive plant) connected to a bioamplifier and can sense human movements by analyzing the biosignals with a machine learning model. Through sharing multiple cyborgs' biosignals, this network portrays the concept of multiple beings transcending an individual's physical confines to form a Bio Internet of Things (IOT) system capable of perception, feedback, and group decision-making within a wider scope. A particular feature of this system is its interactive bone induction headphones, where the audience can listen to a sound field including 'vibrations' of nearby human activities detected by plant cyborgs, and even warnings among the cyborg network responding to foreign disturbance and damage. This sound field invites audiences to close their eyes and listen attentively to plants while the biosignals and changes in sound reveal the presence of other entities in the space.

Abstract:
Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal Transformer consists of three components: Temporal Deformable Transformer Encoder (TDTE) to encode the multiple frame spatial details, Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. TransVOD yields comparable results performance on the benchmark of ImageNet VID. We hope our TransVOD can provide a new perspective for video object detection.

Abstract:
The rising of deep learning has facilitated the development of single image super-resolution (SISR). However, the growing burdensome model complexity and memory occupation severely hinder its practical deployments on resource-limited devices. In this paper, we propose a novel joint-distillation (JDSR) framework to boost the representation of various off-the-shelf lightweight SR models. The framework includes two stages: the superior LR generation and the joint-distillation learning. The superior LR is obtained from the HR image itself. With less than 300K parameters, the peer network using superior LR as input can achieve comparable SR performance with large models, e.g., RCAN, with 15M parameters, which enables it as the input of peer network to save the training expense. The joint-distillation learning consists of internal self-distillation and external mutual learning. The internal self-distillation aims to achieve model self-boosting by transferring the knowledge from the deeper SR output to the shallower one. Specifically, each intermediate SR output is supervised by the HR image and the soft label from subsequent deeper outputs. To shrink the capacity gap between shallow and deep layers, a soft label generator is designed in a progressive backward fusion way with meta-learning for adaptive weight fine-tuning. The external mutual learning focuses on obtaining interaction information from a peer network in the process. Moreover, a curriculum learning strategy and a performance gap threshold are introduced for balancing the convergence rate of the original SR model and its peer network. Comprehensive experiments on benchmark datasets demonstrate that our proposal improves the performance of recent lightweight SR models by a large margin, with the same model architecture and inference expense.

Abstract:
Face anti-spoofing is an important step for secure face recognition. One of the main challenges is how to learn and build a general classifier that is able to resist various presentation attacks. Recently, the patch-based face anti-spoofing schemes are shown to be able to improve the robustness of the classifier. These schemes extract subtle liveness cues from small local patches independently, which do not fully exploit the correlations among the patches. In this paper, we propose a Patch-based Compact Graph Network (PCGN) to diffuse the subtle liveness cues from all the patches. Firstly, the image is encoded into a compact graph by connecting each node with its backward neighbors. We then propose an asymmetrical updating strategy to update the compact graph. Such a strategy aggregates the node based on whether it is a sender or receiver, which leads to better message-passing. The updated graph is eventually decoded for making the final decision. We conduct the experiments on four public databases with four intra-database protocols and eight cross-database protocols, the results of which demonstrate the effectiveness of our PCGN for face anti-spoofing.

Abstract:
Visual grounding has attracted much attention with the popularity of vision language. Existing one-stage methods are far ahead of two-stage methods in speed. However, these methods fuse the textual feature and visual feature map by simply concatenation, which ignores the textual semantics and limits these models' ability in cross-modal understanding. To overcome this weakness, we propose a semantic-aware framework that utilizes both queries' structured knowledge and context-sensitive representations to filter the visual feature maps to localize the referents more accurately. Our framework contains an entity filter, an attribute filter, and a location filter. These three filters filter the input visual feature map step by step according to each query's aspects respectively. A grounding module further regresses the bounding boxes to localize the referential object. Experiments on various commonly used datasets show that our framework achieves a real-time inference speed and outperforms all state-of-the-art methods.

Abstract:
The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video. Albeit moderate improvements in current approaches, they commonly require high-quality homologous data sources of videos and audios, thus causing the failure to leverage heterogeneous data sufficiently. In practice, it may be intractable to collect the perfect homologous data in some cases, for example, audio-corrupted or picture-blurry videos. To explore this kind of data and support high-fidelity few-shot visual dubbing, in this paper, we novelly propose a simple yet efficient two-stage framework with a higher flexibility of mining heterogeneous data. Specifically, our two-stage paradigm employs facial landmarks as intermediate prior of latent representations and disentangles the lip movements prediction from the core task of realistic talking head generation. By this means, our method makes it possible to independently utilize the training corpus for two-stage sub-networks using more available heterogeneous data easily acquired. Besides, thanks to the disentanglement, our framework allows a further fine-tuning for a given talking head, thereby leading to better speaker-identity preserving in the final synthesized results. Moreover, the proposed method can also transfer appearance features from others to the target speaker. Extensive experimental results demonstrate the superiority of our proposed method in generating highly realistic videos synchronized with the speech over the state-of-the-art.

Abstract:
Existing multi-modal subspace clustering methods, aiming to exploit the correlation information between different modalities, have achieved promising preliminary results. However, these methods might be incapable of handling real problems with complex heterogeneous structures between different modalities, since the large heterogeneous structure makes it difficult to directly learn a discriminative shared self-representation for multi-modal clustering. To tackle this problem, in this paper, we propose a deep Self-supervised t-SNE method (StSNE) for multi-modal subspace clustering, which learns soft label features by multi-modal encoders and utilizes the common label feature to supervise soft label feature of each modal by adversarial training and reconstruction networks. Specifically, the proposed StSNE consists of four components: 1) multi-modal convolutional encoders; 2) a self-supervised t-SNE module; 3) a self-expressive layer; 4) multi-modal convolutional decoders. Multi-modal data are fed to encoders to obtain soft label features, for which the self-supervised t-SNE module is added to make full use of the label information among different modalities. Simultaneously, the latent representations given by encoders are constrained by a self-expressive layer to capture the hierarchical information of each modal, followed by decoders reconstructing the encoded features to preserve the structure of the original data. Experimental results on several public datasets demonstrate the superior clustering performance of the proposed method over state-of-the-art methods.

Abstract:
Given a question about an image, a Visual Commonsense Reasoning (VCR) model needs to provide not only a correct answer, but also a rationale to justify the answer. It is a challenging task due to the requirements of diverse visual content understanding, abstract language comprehending, and complicated inter-modality relationship reasoning. To solve above challenges, previous methods either resort to holistic attention mechanism or explore transformer-based model with pre-training, which, however, cannot perform comprehensive understanding and usually suffer from heavy computing burden. In this paper, we propose a novel multi-level counterfactual contrastive learning network for VCR by jointly modeling the hierarchical visual contents and the inter-modality relationships between the visual and linguistic domains. The proposed method enjoys several merits. First, with sufficient instance-level, image-level, and semantic-level contrastive learning, our model can extract discriminative features and perform comprehensive understanding for the image and linguistic expressions. Second, taking advantage of counterfactual thinking, we can generate informative factual and counterfactual samples for contrastive learning, resulting in stronger perception ability of our model. Third, an auxiliary contrast module is incorporated into our method to directly optimize the answer prediction in VCR, which further facilitates the representation learning. Extensive experiments on the VCR dataset demonstrate that our approach performs favorably against the state-of-the-arts.

Abstract:
Chinese character inpainting is a challenging task where large missing regions have to be filled with both visually and semantic realistic contents. Existing methods generally produce pseudo or ambiguous characters due to lack of semantic information. Given the key observation that Chinese characters contain visually glyph representation and intrinsic contextual semantics, we tackle the challenge of similar Chinese characters by modeling the underlying regularities among glyph and semantic information. We propose a semantics enhanced generative framework for Chinese character inpainting, where a global semantic supervising module (GSSM) is introduced to constrain contextual semantics. In particular, sentence embedding is used to guide the encoding of continuous contextual characters. The method can not only generate realistic Chinese character, but also explicitly utilize context as reference during network training to eliminate ambiguity. The proposed method is evaluated on both handwritten and printed Chinese characters with various masks. The experiments show that the method successfully predicts missing character information without any mask input, and achieves significant sentence-level results benefiting from global semantic supervising in a wide variety of scenes.

Abstract:
In this paper, we address video instance segmentation using a new generative model that learns effective representations of the target and background appearance. We propose to exploit hierarchical structural embedding over spatio-temporal space, which is compact, powerful, and flexible in contrast to current tracking-by-detection methods. Specifically, our model segments and tracks instances across space and time in a single forward pass, which is formulated as hierarchical embedding learning. The model is trained to locate the pixels belonging to specific instances over a video clip. We firstly take advantage of a novel mixing function to better fuse spatio-temporal embeddings. Moreover, we introduce normalizing flows to further improve the robustness of the learned appearance embedding, which theoretically extends conventional generative flows to a factorized conditional scheme. Comprehensive experiments on the video instance segmentation benchmark, i.e., YouTube-VIS, demonstrate the effectiveness of the proposed approach. Furthermore, we evaluate our method on an unsupervised video object segmentation dataset to demonstrate its generalizability.

Abstract:
Structured text understanding on Visually Rich Documents (VRDs) is a crucial part of Document Intelligence. Due to the complexity of content and layout in VRDs, structured text understanding has been a challenging task. Most existing studies decoupled this problem into two sub-tasks: entity labeling and entity linking, which require an entire understanding of the context of documents at both token and segment levels. However, little work has been concerned with the solutions that efficiently extract the structured data from different levels. This paper proposes a unified framework named StrucTexT, which is flexible and effective for handling both sub-tasks. Specifically, based on the transformer, we introduce a segment-token aligned encoder to deal with the entity labeling and entity linking tasks at different levels of granularity. Moreover, we design a novel pre-training strategy with three self-supervised tasks to learn a richer representation. StrucTexT uses the existing Masked Visual Language Modeling task and the new Sentence Length Prediction and Paired Boxes Direction tasks to incorporate the multi-modal information across text, image, and layout. We evaluate our method for structured text understanding at segment-level and token-level and show it outperforms the state-of-the-art counterparts with significantly superior performance on the FUNSD, SROIE, and EPHOIE datasets.

Abstract:
Cross-modal hashing aims to map the data of different modalities into a common binary space to accelerate the retrieval speed. Recently, deep cross-modal hashing methods have shown promising performance by applying deep neural networks to facilitate feature learning. However, the known supervised deep methods mainly rely on the labeled information of datasets, which is insufficient to characterize the latent structures that exist among different modalities. To mitigate this problem, in this paper, we propose to use Graph Convolutional Networks (GCNs) to exploit the local structure information of datasets for cross-modal hash learning. Specifically, a local graph is constructed according to the neighborhood relationships between samples in deep feature spaces and fed into GCNs to generate graph embeddings. Then, a within-modality loss is designed to measure the inner products between deep features and graph embeddings so that hashing networks and GCNs can be jointly optimized. By taking advantage of GCNs to assist model's training, the performance of hashing networks can be improved. Extensive experiments on benchmarks verify the effectiveness of the proposed method.

Abstract:
Detecting facial forgery images and videos is an increasingly important topic in multimedia forensics. As forgery images and videos are usually compressed into different formats such as JPEG and H264 when circulating on the Internet, existing forgery-detection methods trained on uncompressed data often suffer from significant performance degradation in identifying them. To solve this problem, we propose a novel anti-compression facial forgery detection framework, which learns a compression-insensitive embedding feature space utilizing both original and compressed forgeries. Specifically, our approach consists of three ideas: (i) extracting compression-insensitive features from both uncompressed and compressed forgeries using an adversarial learning strategy; (ii) learning a robust partition by constructing a metric loss that can reduce the distance of the paired original and compressed images in the embedding space; (iii) improving the accuracy of tampered localization with an attention-transfer module. Experimental results demonstrate that, the proposed method is highly effective in handling both compressed and uncompressed facial forgery images.

Abstract:
Tracking with Natural-Language Specification (TNL) is a joint topic of understanding the vision and natural language with a wide range of applications. In previous works, the communication between two heterogeneous features of vision and language is mainly through a simple dynamic convolution. However, the performance of prior works is capped by the difficulty of linguistic variation of natural language in modeling the dynamically changing target and its surroundings. In the meanwhile, natural language and vision are firstly fused and then utilized for tracking, which is hard to model the query-focused context. Query-focused should pay more attention to context modeling to promote the correlation between these two features. To address these issues, we propose a capsule-based network, referred to as CapsuleTNL, which performs regression tracking with natural language query. In the beginning, the visual and textual input is encoded with capsules, which can not only establish the relationship between entities but also the relationship between the parts of the entity itself. Then, we devise two interaction routing modules, which consist of visual-textual routing module to reduce the linguistic variation of input query and textual-visual routing module to precisely incorporate query-based visual cues simultaneously. To validate the potential of the proposed network for visual object tracking, we evaluate our method on two large tracking benchmarks. The experimental evaluation demonstrates the effectiveness of our capsule-based network.

Abstract:
Human brains are known to be capable of speeding up visual recognition of repeatedly presented objects through faster memory encoding and accessing procedures on activated neurons. For the first time, we borrow and distill such a capability into a semantic memory design, namely SMTM, to improve on-device CNN inference. SMTM employs a hierarchical memory architecture to leverage the long-tail distribution of objects of interest, and further incorporates several novel techniques to put it into effects: (1) it encodes high-dimensional feature maps into low-dimensional, semantic vectors for low-cost yet accurate cache and lookup; (2) it uses a novel metric in determining the exit timing considering different layers' inherent characteristics; (3) it adaptively adjusts the cache size and semantic vectors to fit the scene dynamics. SMTM is prototyped on commodity CNN engine and runs on both mobile CPU and GPU. Extensive experiments on large-scale datasets and models show that SMTM can significantly speed up the model inference over standard approach (up to 2×) and prior cache designs (up to 1.5x), with acceptable accuracy loss.

Abstract:
It is full of challenges for weakly supervised semantic segmentation (WSSS) acquiring the pixel-level object location with only image-level annotations. Especially, the single-stage methods learn image- and pixel-level labels simultaneously to avoid complicated multi-stage computations and sophisticated training procedures. In this paper, we argue that using a single model to accomplish image- and pixel-level classification will fall into the balance of multi-target and consequently weakens the recognition capability. Because the image-level task tends to learn position-independent features, but the pixel-level task tends to be position-sensitive. Hence, we propose an effective encoder-decoder framework to explore object boundaries and solve the above dilemma. The encoder and decoder learn position-independent and position-sensitive features independently during the end-to-end training. In addition, a global soft pooling is suggested to suppress background pixels' activation for the encoder training and further improve the class activation map (CAM) performance. The edge annotations for the decoder training are synthesized by the high confidence CAMs, which do not requires extra supervision. The extensive experiments on the Pascal VOC12 dataset demonstrate that our method achieves state-of-the-art compared to the end-to-end approaches. It gets 63.6% and 65.7% mIoU scores on val and test sets respectively.

Abstract:
This paper attacks an emerging challenge of multi-modal retinal disease recognition. Given a multi-modal case consisting of a color fundus photo (CFP) and an array of OCT B-scan images acquired during an eye examination, we aim to build a deep neural network that recognizes multiple vision-threatening diseases for the given case. As the diagnostic efficacy of CFP and OCT is disease-dependent, the network's ability of being both selective and interpretable is important. Moreover, as both data acquisition and manual labeling are extremely expensive in the medical domain, the network has to be relatively lightweight for learning from a limited set of labeled multi-modal samples. Prior art on retinal disease recognition focuses either on a single disease or on a single modality, leaving multi-modal fusion largely underexplored. We propose in this paper Multi-Modal Multi-Instance Learning (MM-MIL) for selectively fusing CFP and OCT modalities. Its lightweight architecture (as compared to current multi-head attention modules) makes it suited for learning from relatively small-sized datasets. For an effective use of MM-MIL, we propose to generate a pseudo sequence of CFPs by over sampling a given CFP. The benefits of this tactic include well balancing instances across modalities, increasing the resolution of the CFP input, and finding out regions of the CFP most relevant with respect to the final diagnosis. Extensive experiments on a real-world dataset consisting of 1,206 multi-modal cases from 1,193 eyes of 836 subjects demonstrate the viability of the proposed model.

Abstract:
Lipreading, aiming at interpreting speech by watching the lip movements of the speaker, has great significance in human communication and speech understanding. Despite having reached a feasible performance, lipreading still faces two crucial challenges: 1) the considerable lip movement variations cross different persons when they utter the same words; 2) the similar lip movements of people when they utter some confused phonemes. To tackle these two problems, we propose a novel lipreading framework, CALLip, which employs attribute learning and contrastive learning. The attribute learning extracts the speaker identity-aware features through a speaker recognition branch, which are able to normalize the lip shapes to eliminate cross-speaker variations. Considering that audio signals are intrinsically more distinguishable than visual signals, the contrastive learning is devised between visual and audio signals to enhance the discrimination of visual features and alleviate the viseme confusion problem. Experimental results show that CALLip does learn better features of lip movements. The comparisons on both English and Chinese benchmark datasets, GRID and CMLR, demonstrate that CALLip outperforms six state-of-the-art lipreading methods without using any additional data.

Abstract:
Fully Convolutional Networks with attention modules have been proven effective for learning-based image inpainting. While many existing approaches could produce visually reasonable results, the generated images often show blurry textures or distorted structures around corrupted areas. The main reason is due to the fact that convolutional neural networks have limited capacity for modeling contextual information with long range dependencies. Although the attention mechanism can alleviate this problem to some extent, existing attention modules tend to emphasize similarities between the corrupted and the uncorrupted regions while ignoring the dependencies from within each of them. Hence, this paper proposes the Contextual Transformer Network (CTN) which not only learns relationships between the corrupted and the uncorrupted regions but also exploits their respective internal closeness. Besides, instead of a fully convolutional network, in our CTN, we stack several transformer blocks to replace convolution layers to better model the long range dependencies. Finally, by dividing the image into patches of different sizes, we propose a multi-scale multi-head attention module to better model the affinity among various image regions. Experiments on several benchmark datasets demonstrate superior performance by our proposed approach.

Abstract:
In addition to visual components, many images usually contain valuable text information, which is essential for understanding the scene. Thus, we study the TextVQA task that requires reading texts in images to answer corresponding questions. However, most of previous works utilize sophisticated graph structure and manually crafted features to model the position relationship between visual entities and texts in images. And traditional multimodal transformers cannot effectively capture relative position information and original image features. To address these issues in an intuitive but effective way, we propose a novel model, position-augmented transformers with entity-aligned mesh, for the TextVQA task. Different from traditional attention mechanism in transformers, we explicitly introduce continuous relative position information of objects and OCR tokens without complex rules. Furthermore, we replace the complicated graph structure with intuitive entity-aligned mesh according to perspective mapping. In this mesh, the information of discrete entities and image patches at different positions can interact with each other. Extensive experiments on two benchmark datasets (TextVQA and ST-VQA) show that our proposed model is superior to several state-of-the-art methods.

Abstract:
In this talk, we present our experiences and applications of large-scale multi-modality pretrained models, developed at Alibaba and Ant Group. We first present a cross-modal pretraining method called M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer) [1], for unified pretraining on the data of multiple modalities. We scale the model size up to 1 trillion parameters [2], and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation [3], and show that the finetuned M6 can create high-quality images with high resolution and fidelity.

Abstract:
Attempting to fully exploit the rich information of topological structure and node features for attributed graph, we introduce self-supervised learning mechanism to graph representation learning and propose a novel Self-supervised Consensus Representation Learning (SCRL) framework. In contrast to most existing works that only explore one graph, our proposed SCRL method treats graph from two perspectives: topology graph and feature graph. We argue that their embeddings should share some common information, which could serve as a supervisory signal. Specifically, we construct the feature graph of node features via k-nearest neighbour algorithm. Then graph convolutional network (GCN) encoders extract features from two graphs respectively. Self-supervised loss is designed to maximize the agreement of the embeddings of the same node in the topology graph and the feature graph. Extensive experiments on real citation networks and social networks demonstrate the superiority of our proposed SCRL over the state-of-the-art methods on semi-supervised node classification task. Meanwhile, compared with its main competitors, SCRL is rather efficient.

Abstract:
As a representative of multi-view clustering (MVC), late fusion MVC (LF-MVC) algorithm has attracted intensive attention due to its superior clustering accuracy and high computational efficiency. One common assumption adopted by existing LF-MVC algorithms is that all views of each sample are available. However, it is widely observed that there are incomplete views for partial samples in practice. In this paper, we propose One-Stage Late Fusion Incomplete Multi-view Clustering (OS-LF-IMVC) to address this issue. Specifically, we propose to unify the imputation of incomplete views and the clustering task into a single optimization procedure, so that the learning of the consensus partition matrix can directly assist the final clustering task. To optimize the resultant optimization problem, we develop a five-step alternate strategy with theoretically proved convergence. Comprehensive experiments on multiple benchmark datasets are conducted to demonstrate the efficiency and effectiveness of the proposed OS-LF-IMVC algorithm.

Abstract:
Incomplete multi-view clustering is an important research topic in multimedia where partial data entries of one or more views are missing. Current subspace clustering approaches mostly employ matrix factorization on the observed feature matrices to address this issue. Meanwhile, self-representation technique is left unexplored, since it explicitly relies on full data entries to construct the coefficient matrix, which is contradictory to the incomplete data setting. However, it is widely observed that self-representation subspace method enjoys a better clustering performance over the factorization based one. Therefore, we adapt it to incomplete data by jointly performing data imputation and self-representation learning. To the best of our knowledge, this is the first attempt in incomplete multi-view clustering literature. Besides, the proposed method is carefully compared with current advances in experiment with respect to different missing ratios, verifying its effectiveness.

Abstract:
The social experience is an important part of art exhibitions. This demo introduces an eye-gaze based generative art prototype for virtual reality (VR) art exhibitions. Our work extends the visitors' experience from individual art exploration to become content co-creators. The design generates live community artworks based on all visitors' visual interactions with VR paintings. During our VR exhibition at a public gallery, over 100 visitors participated in the new creative process for community-generated artworks.

Abstract:
Nowadays, almost everyone can shoot photos using smart phones. However, not everyone can take good photos. We propose to use computational aesthetics to automatically teach people without photography training to take excellent photos. We present Aesthetic Dashboard: a system of rich aesthetic evaluation and guidance for mobile photography. We take 2 most used types of photos: landscapes and portraits into consideration. When people take photos in the preview mode, for landscapes, we show the overall aesthetic score and scores of 3 basic attributes: light, composition and color usage. Meanwhile, the matching scores of the 3 basic attributes of current preview to typical templates are shown, which can help users to adjust 3 basic attributes accordingly. For portraits, besides the above basic attributes, the facial appearance, the guidance of face light, body pose and the garment color are also shown to the users. This is the first system that can teach mobile users to shoot good photos in the form of aesthetic dashboard, through which, users can adjust several aesthetic attributes to take good photos easily.

Abstract:
In this paper, we demonstrate Post2Story, which aims to detect events and generate storylines on microblog posts. Post2Story has several new features: (1) It proposes to employ social influence to extract events from microblogs. (2) It presents a new Event Graph Convolutional Network (E-GCN) model to learn the latent relationships among events, which can help predict the story branch of an event and link events. (3) It offers a user-friendly interface to extract and visualize the development of events. After an introduction to the system architecture and key technologies of Post2Story, we demonstrate the functionalities of Post2Story on a real dataset.

Abstract:
Human action detection is a very important yet difficult task for various multimedia applications such as safety surveillance, sports video analysis and video editing in media industry. Most existing methods proposed for action detection are machine learning based approaches, however, highly time- and cost-consuming to prepare training data with annotations. Thus, it is still very difficult to apply these methods for industrial applications where the actions of interests might happen rarely in real scenarios such as criminal or suspicious behaviors, because it is impossible to collect a large number of such training data for target actions. In this paper, we disruptively abandon these conventional methods, alternatively, adopting an on-demand retrieval approach using pose information to handle the action detection task. We introduce a demo system that can detect similar actions immediately by specifying a few second sample video without any training process. The system demonstrates the usability and efficacy of our on-demand approach for human action detection. The experimental results are reported to show that our approach outperforms the state-of-the-art method in higher precision and recall, up to 11% and 6.1% improvement, respectively.

Abstract:
Deep learning methods have achieved great success on semantic segmentation in recent years. But the training typically relies on large-scale fully-annotated ground truth masks, which are difficult to obtain in practice. In this research, we study the problem of reducing the annotation cost of segmentation network training with a focus on exploring the shape prior knowledge of objects. Under the context of three applications, we study three types of shape priors. Specifically, we first exploit the implicit shape prior of curve structures to propose a weakly supervised curve structure segmentation method, and then explicitly formulate the shape prior of anatomical structures as loss functions to propose a one-shot anatomical structures segmentation network. Last, we try to generalize the shape constraint to arbitrary objects to propose a class-agnostic few-shot segmentation framework. Experiment results show that our methods could achieve comparable or better performance than fully supervised segmentation methods with less annotation costs on the studied applications.

Abstract:
Image style transfer is a recently popular research field, which aims to learn the mapping between different domains and involves different computer vision techniques. Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials of translating images from source domain X to target domain Y in the absence of paired examples. However, such a translation cannot guarantee to generate high perceptual quality results. Existing style transfer methods work well with relatively uniform content, they often fail to capture geometric or structural patterns that reflect the quality of generated images. The goal of this doctoral research is to investigate the image style transfer approaches, and design advanced and useful methods to solve existing problems. Though preliminary experiments conducted so far, we demonstrate our insights on the image style translation approaches, and present the directions to be pursued in the future.

Abstract:
Supervised learning for vision tasks has achieved great success be-cause of the advances of deep learning research in many areas, such as high quality datasets, network architectures and regularization methods. In the vanilla deep learning paradigm, training a model for visual tasks is mainly based on the provided training images and annotations. Inspired by human learning with knowledge transfer where information from multiples modalities are considered, we pro-pose to improve visual tasks' performance by introducing explicit knowledge extracted from other modalities. As the first step, we propose to improve image classification performance by introducing linguistic knowledge as additional constraints in model learning. This knowledge is represented as a set of constraints to be jointly utilized with visual knowledge. To coordinate the training dynamic, we propose to imbue our model the ability of dynamic distilling from multiple knowledge sources. This is done via a model agnostic knowledge weighting module which guides the learning process and updates via meta-steps during training. Preliminary experiments on various benchmark datasets validate the efficacy of our method. Our code will be made publicly available to ensure reproducibility.

Abstract:
Contrastive-based self-supervised learning for image representations has significantly closed the gap with supervised learning. A natural extension of image-based contrastive learning methods to the video domain is to fully exploit the temporal structure presented in videos. We propose a novel contrastive self-supervised video representation learning framework, termed Graph Contrastive Augmentation (GCA), by constructing a video temporal graph and devising a graph augmentation that is designed to enhance the correlation across frames of videos and developing a new view for exploring temporal structure in videos. Specifically, we construct the temporal graph in the video by leveraging the relational knowledge behind the correlated sequence video features. Afterwards, we apply the proposed graph augmentation to generate another graph view by cooperating random corruption of the original graph to enhance the diversity of the intrinsic structure of the temporal graph. To this end, we provide two different kinds of contrastive learning methods to train our framework using temporal relationships concealed in videos as self-supervised signals. We perform empirical experiments on downstream tasks, action recognition and video retrieval, using the learned video representation, and the results demonstrate that with the graph view of temporal structure, our proposed GCA remarkably improves performance against or on par with the recent methods.

Abstract:
Social media have become a popular platform for brands to allocate marketing budget and build their relationship with customers. Posting images with a consistent concept on social media helps customers recognize, remember, and consider brands. This strategy is known as brand concept consistency in marketing literature. Consequently, brands spend immense manpower and financial resources in choosing which images to post or repost. Therefore, automatically recommending images with a consistent brand concept is a necessary task for social media marketing. In this paper, we propose a content-based recommendation system that learns the concept of brands and recommends images that are coherent with the brand. Specifically, brand representation is performed from the brand posts on social media. Existing methods rely on visual features extracted by pre-trained neural networks, which can represent objects in the image but not the style of the image. To bridge this gap, a framework using both object and style vectors as input is proposed to learn the brand representation. In addition, we show that the proposed method can not only be applied to brands but also be applied to influencers. We collected a new Instagram influencer dataset, consisting of 616 influencers and about 1 million images, which can greatly benefit future research in this area. The experimental results on two large-scale Instagram datasets show the superiority of the proposed method over state-of-the-art methods.

Abstract:
Existing Few-Shot Learning (FSL) methods predominantly focus on developing different types of sophisticated models to extract the transferable prior knowledge for recognizing novel classes, while they almost pay less attention to the feature learning part in FSL which often simply leverage some well-known CNN as the feature learner. However, feature is the core medium for encoding such transferable knowledge. Feature learning is easy to be trapped in the over-fitting particularly in the scarcity of the training data, and thereby degenerates the performances of FSL. The handcrafted features, such as Histogram of Oriented Gradient (HOG) and Local Binary Pattern (LBP), have no requirement on the amount of training data, and used to perform quite well in many small-scale data scenarios, since their extractions involve no learning process, and are mainly based on the empirically observed and summarized prior feature engineering knowledge. In this paper, we intend to develop a general and simple approach for generally boosting FSL via exploiting such prior knowledge in the feature learning phase. To this end, we introduce two novel handcrafted feature regression modules, namely HOG and LBP regression, to the feature learning parts of deep learning-based FSL models. These two modules are separately plugged into the different convolutional layers of backbone based on the characteristics of the corresponding handcrafted features to guide the backbone optimization from different feature granularity, and also ensure that the learned feature can encode the handcrafted feature knowledge which improves the generalization ability of feature and alleviate the over-fitting of the models. Three recent state-of-the-art FSL approaches are leveraged for examining the effectiveness of our method. Extensive experiments on miniImageNet, CIFAR-FS and FC100 datasets show that the performances of all these FSL approaches are well boosted via applying our method on all three datasets. Our codes and models have been released.

Abstract:
Image-based Vehicle ReID methods have suffered from limited information caused by viewpoints, illumination, and occlusion as they usually use a single image as input. Graph convolutional methods (GCN) can alleviate the aforementioned problem by aggregating neighbor samples' information to enhance the feature representation. However, it's uneconomical and computational for the inference processes of GCN-based methods since they need to iterate over all samples for searching the neighbor nodes. In this paper, we propose the first Pseudo-GCN Vehicle ReID method (PGVR) which enables a CNN-based module to performs competitively to GCN-based methods and has a faster and lightweight inference process. To enable the Pseudo-GCN mechanism, a two-branch network and a graph-based knowledge distillation are proposed. The two-branch network consists of a CNN-based student branch and a GCN-based teacher branch. The GCN-based teacher branch adopts a ReID-based GCN to learn the topological optimization ability under the supervision of ReID tasks during training time. Moreover, the graph-based knowledge distillation explicitly transfers the topological optimization ability from the teacher branch to the student branch which acknowledges all nodes. We evaluate our proposed method PGVR on three mainstream Vehicle ReID benchmarks and demonstrate that PGVR achieves state-of-the-art performance.

Abstract:
Sign Language Production (SLP) aims to automatically translate a spoken language description to its corresponding sign language video. The core procedure of SLP is to transform sign gloss intermediaries into sign pose sequences (G2P). Most existing methods for G2P are based on sequential autoregression or sequence-to-sequence encoder-decoder learning. However, by generating target pose frames conditioned on the previously generated ones, these models are prone to bringing issues such as error accumulation and high inference latency. In this paper, we argue that such issues are mainly caused by adopting autoregressive manner. Hence, we propose a novel Non-AuToregressive (NAT) model with a parallel decoding scheme, as well as an External Aligner for sequence alignment learning. Specifically, we extract alignments from the external aligner by monotonic alignment search for gloss duration prediction, which is used by a length regulator to expand the source gloss sequence to match the length of the target sign pose sequence for parallel sign pose generation. Furthermore, we devise a spatial-temporal graph convolutional pose generator in the NAT model to generate smoother and more natural sign pose sequences. Extensive experiments conducted on PHOENIX14T dataset show that our proposed model outperforms state-of-the-art autoregressive models in terms of speed and quality.

Abstract:
Compositional action recognition is a novel challenge in the computer vision community and focuses on revealing the different combinations of verbs and nouns instead of treating subject-object interactions in videos as individual instances only. Existing methods tackle this challenging task by simply ignoring appearance information or fusing object appearances with dynamic instance tracklets. However, those strategies usually do not perform well for unseen action instances. For that, in this work we propose a novel learning framework called Counterfactual Debiasing Network (CDN) to improve the model generalization ability by removing the interference introduced by visual appearances of objects/subjects. It explicitly learns the appearance information in action representations and later removes the effect of such information in a causal inference manner. Specifically, we use tracklets and video content to model the factual inference by considering both appearance information and structure information. In contrast, only video content with appearance information is leveraged in the counterfactual inference. With the two inferences, we conduct a causal graph which captures and removes the bias introduced by the appearance information by subtracting the result of the counterfactual inference from that of the factual inference. By doing that, our proposed CDN method can better recognize unseen action instances by debiasing the effect of appearances. Extensive experiments on the Something-Else dataset clearly show the effectiveness of our proposed CDN over existing state-of-the-art methods.

Abstract:
Self-supervised learning (SSL) has recently become the favorite among feature learning methodologies. It is therefore appealing for domain adaptation approaches to consider incorporating SSL. The intuition is to enforce instance-level feature consistency such that the predictor becomes somehow invariant across domains. However, most existing SSL methods in the regime of domain adaptation usually are treated as standalone auxiliary components, leaving the signatures of domain adaptation unattended. Actually, the optimal region where the domain gap vanishes and the instance level constraint that SSL peruses may not coincide at all. From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i.e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently. We find contrastive learning intrinsically a suitable candidate for domain adaptation, as its instance invariance assumption can be conveniently promoted to cross-domain class-level invariance favored by domain adaptation tasks. Based on particular memory bank constructions and pseudo label strategies, TCL then penalizes cross-domain intra-class domain discrepancy between source and target through a clean and novel contrastive loss. The free lunch is, thanks to the incorporation of contrastive learning, TCL relies on a moving-averaged key encoder that naturally achieves a temporally ensembled version of pseudo labels for target data, which avoids pseudo label error propagation at no extra cost. TCL therefore efficiently reduces cross-domain gaps. Through extensive experiments on benchmarks (Office-Home, VisDA-2017, Digits-five, PACS and DomainNet) for both single-source and multi-source domain adaptation tasks, TCL has demonstrated state-of-the-art performances.

Abstract:
Action detection plays an important role in high-level video understanding and media interpretation. Many existing studies fulfill this spatio-temporal localization by modeling the context, capturing the relationship of actors, objects, and scenes conveyed in the video. However, they often universally treat all the actors without considering the consistency and distinctness between individuals, leaving much room for improvement. In this paper, we explicitly highlight the identity information of the actors in terms of both long-term and short-term context through a graph memory network, namely identity-aware graph memory network (IGMN). Specifically, we propose the hierarchical graph neural network (HGNN) to comprehensively conduct long-term relation modeling within the same identity as well as between different ones. Regarding short-term context, we develop a dual attention module (DAM) to generate identity-aware constraint to reduce the influence of interference by the actors of different identities. Extensive experiments on the challenging AVA dataset demonstrate the effectiveness of our method, which achieves state-of-the-art results on AVA v2.1 and v2.2.

Abstract:
The rapid development of facial manipulation techniques has aroused public concerns in recent years. Following the success of deep learning, existing methods always formulate DeepFake video detection as a binary classification problem and develop frame-based and video-based solutions. However, little attention has been paid to capturing the spatial-temporal inconsistency in forged videos. To address this issue, we term this task as a Spatial-Temporal Inconsistency Learning (STIL) process and instantiate it into a novel STIL block, which consists of a Spatial Inconsistency Module (SIM), a Temporal Inconsistency Module (TIM), and an Information Supplement Module (ISM). Specifically, we present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation. Moreover, our STIL block is flexible and could be plugged into existing 2D CNNs. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the state-of-the-art competitors.

Abstract:
We introduce Imuge, an image tamper resilient generative scheme for image self-recovery. The traditional manner of concealing image content within the image are inflexible and fragile to diverse digital attack, i.e. image cropping and JPEG compression. To address this issue, we jointly train a U-Net backboned encoder, a tamper localization network and a decoder for image recovery. Given an original image, the encoder produces a visually indistinguishable immunized image. At the recipient's side, the verifying network localizes the malicious modifications, and the original content can be approximately recovered by the decoder, despite the presence of the attacks. Several strategies are proposed to boost the training efficiency. We demonstrate that our method can recover the details of the tampered regions with a high quality despite the presence of various kinds of attacks. Comprehensive ablation studies are conducted to validate our network designs.

Abstract:
The core problem of video visual relation detection (VidVRD) lies in accurately classifying the relation triplets, which comprise of the classes of subject and object entities, and the predicate classes of various relationships between them. Existing VidVRD approaches classify these three relation components in either independent or cascaded manner, thus fail to fully exploit the inter-dependency among them. In order to utilize this inter-dependency in tackling the challenges of visual relation recognition in videos, we propose a novel iterative relation inference approach for VidVRD. We derive our model from the viewpoint of joint relation classification which is light-weight yet effective, and propose a training approach to better learn the dependency knowledge from the likely correct triplet combinations. As such, the proposed inference approach is able to gradually refine each component based on its learnt dependency and the other two's predictions. Our ablation studies show that this iterative relation inference can empirically converge in a few steps and consistently boost the performance over baselines. Further, we incorporate it into a newly designed VidVRD architecture, named VidVRD-II (Iterative Inference), which generalizes well across different datasets. Experiments show that VidVRD-II achieves the start-of-the-art performance on both of ImageNet-VidVRD and VidOR benchmark datasets.

Abstract:
ICECUBE LED Display [ILDm^3] is a cubic-meter, 1/1000th scale model of the IceCube Neutrino Observatory, a novel telescope that looks for nearly invisible cosmic messengers, neutrinos, using a cubic-kilometer of instrumented ice starting 1450 meters below the surface at the South Pole. The display uses art methodologies as a means for expressing imperceptible astrophysical events as sound, light and colour in the domain of the human sensorium. The experience is as aesthetically critical as it is facilitatory to an intuitive understanding of subatomic astrophysical data, leading to new ways of knowing about our Universe and its processes.

Abstract:
Music and dance have always co-existed as pillars of human activities, contributing immensely to the cultural, social, and entertainment functions in virtually all societies. Notwithstanding the gradual systematization of music and dance into two independent disciplines, their intimate connection is undeniable and one art-form often appears incomplete without the other. Recent research works have studied generative models for dance sequences conditioned on music. The dual task of composing music for given dances, however, has been largely overlooked. In this paper, we propose a novel extension, where we jointly model both tasks in a dual learning approach. To leverage the duality of the two modalities, we introduce an optimal transport objective to align feature embeddings, as well as a cycle consistency loss to foster overall consistency. Experimental results demonstrate that our dual learning framework improves individual task performance, delivering generated music compositions and dance choreographs that are realistic and faithful to the conditioned inputs.

Abstract:
We have seen a dramatic increase in the adoption of teleconferencing systems such as Zoom for remote teaching and working. Although designed primarily for traditional video conferencing scenarios, these platforms are actually being deployed in many diverse contexts. As such, Zoom offers little to aid hosts' understanding of attendee participation and often hinders participant agency. We introduce ZoomSense : an open-source, scalable infrastructure built upon 'virtual meeting participants', which exposes real-time meta-data, meeting content and host controls through an easy to use abstraction - so that developers can rapidly and sustainably augment Zoom.

Abstract:
We propose an approach that enhances arbitrary existing cross-modal image retrieval performance. Most of the cross-modal image retrieval methods mainly focus on direct computation of similarities between a text query and candidate images in an accurate way. However, their retrieval performance is affected by the ambiguity of text queries and the bias of target databases (DBs). Dealing with ambiguous text queries and DBs with bias will lead to accurate cross-modal image retrieval in real-world applications. A DB-adaptive re-ranking method using modality-driven spaces, which can extend arbitrary cross-modal image retrieval methods for enhancing their performance, is proposed in this paper. The proposed method includes two approaches: "DB-adaptive re-ranking'' and "modality-driven clue information extraction''. Our method estimates clue information that can effectively clarify the desired image from the whole set of a target DB and then receives user's feedback for the estimated information. Furthermore, our method extracts more detailed information of a query text and a target DB by focusing on modality-driven spaces, and it enables more accurate re-ranking. Our method allows users to reach their desired single image by just answering questions. Experimental results using MSCOCO, Visual Genome and newly introduced datasets including images with a particular object show that the proposed method can enhance the performance of state-of-the-art cross-modal image retrieval methods.

Abstract:
Despite the recent progress of cross-modal text-to-video retrieval techniques, their performance is still unsatisfactory. Most existing works follow a trend of learning a joint embedding space to measure the distance between global-level or local-level textual and video representation. The fine-grained interactions between video segments and phrases are usually neglected in cross-modal learning, which results in suboptimal retrieval performances. To tackle the problem, we propose a novel Fine-grained Cross-modal Alignment Network (FCA-Net), which considers the interactions between visual semantic units (i.e., sub-actions/sub-events) in videos and phrases in sentences for cross-modal alignment. Specifically, the interactions between visual semantic units and phrases are formulated as a link prediction problem optimized by a graph auto-encoder to obtain the explicit relations between them and enhance the aligned feature representation for fine-grained cross-modal alignment. Experimental results on MSR-VTT, YouCook2, and VATEX datasets demonstrate the superiority of our model as compared to the state-of-the-art method.

Abstract:
Music recommendation has been one of the most used information retrieval services on internet. Finding suitable music for users' demands from tens of millions of music relies on the understanding of music content. Traditional studies usually focus on music representation based on massive user behavioral data and music meta-data, which ignore the audio characteristic of music. However, it is found that the melodic characteristics of music themselves can be further used to understand music. Moreover, how to utilize large-scale audio data to learn music representation is not well explored. To this end, we propose a self-supervised learning model for music representation. We firstly utilize a beat-level music pre-training model to learn the structure of music. Then, we use a multi-task learning framework to model music self-representation and co-relations between music, concurrently. Besides, we propose several downstream tasks to evaluate music representation, including music genre classification, music highlight, and music similarity retrieval. Extensive experiments on multiple music datasets demonstrate our model's superiority over baselines on learning music representation.

Abstract:
In an intelligent society, image compression needs to serve both human vision and machine vision. Traditional image compression schemes only consider visual quality for humans. In addition, the bitstream needs to be fully decoded to images before performing semantic analysis (e.g., by deep neural networks). These two factors make traditional image compression schemes semantically inefficient. To better serve the needs of both human vision and machine vision, it is more reasonable to compress and transmit image signals and features simultaneously. In this paper, we propose a novel end-to-end semantic scalable image compression method, which progressively compresses coarse-grained semantic features, fine-grained semantic features, and image signals. To utilize the cross-layer correlation between features and image signals, we propose a cross-layer context model to reduce the information redundancy, which takes higher-layer features as cross-layer priors to predict the probability distribution parameters for the entropy model of lower-layer features or images. Furthermore, we adopt a Region of Interest (ROI) compression scheme. The objects with rich semantic information and the background are compressed separately, to further improve the compression efficiency. Experimental results on the CUB-200-2011 and FGVC-Aircraft datasets demonstrate the effectiveness of our proposed scheme compared to separate compression of image signals and features.

Abstract:
In this paper, we propose an end-to-end open set face anti-spoofing (OSFA) approach for unseen attack recognition. Previous domain generalization approaches aim to align multiple domains beyond one common subspace, leading to performance degradation due to the discrepancy of different domains. To address this issue, our approach formulates face anti-spoofing (FAS) in an open set recognition framework, which learns compact representation for each known class in parallel to recognizing unseen attack examples. To this end, we introduce the statistical extreme value theory incorporated in our objective under the multi-task framework. Moreover, we develop an identity-aware contrastive learning method, preventing us from confusion in unseen attack examples versus hard examples. Experimental results on four datasets demonstrate the robustness of our proposed OSFA, especially under diverse categories of unseen attacks.

Abstract:
This paper proposes an attention-guided feature disentangling framework (AgFD) to eliminate the large cross-modality discrepancy for Heterogeneous Face Recognition (HFR). Existing HFR methods either focus only on extracting identity features or impose linear/no independence constraints on the decomposed components. Instead, our AgFD disentangles the facial representation and forces intrinsic independence between identity features and identity-irrelevant variations. To this end, an Attention-based Residual Decomposition Module (AbRDM) and an Adversarial Decorrelation Module (ADM) are presented. AbRDM provides hierarchical complementary feature disentanglement, while ADM is introduced for decorrelation learning. Extensive experiments on the challenging CASIA NIR-VIS 2.0 Database, Oulu-CASIA NIR&VIS Database, BUAA-VisNir Database, and IIIT-D Viewed Sketch Database demonstrate the generalization ability and competitive performance of the proposed method.

Abstract:
Few-shot learning (FSL) aims to address the data-scarce problem. A standard FSL framework is composed of two components: (1) Pre-train. Employ the base data to generate a CNN-based feature extraction model (FEM). (2) Meta-test. Apply the trained FEM to acquire the novel data's features and recognize them. FSL relies heavily on the design of the FEM. However, various FEMs have distinct emphases. For example, several may focus more attention on the contour information, whereas others may lay particular emphasis on the texture information. The single-head feature is only a one-sided representation of the sample. Besides the negative influence of cross-domain (e.g., the trained FEM can not adapt to the novel class flawlessly), the distribution of novel data may have a certain degree of deviation compared with the ground truth distribution, which is dubbed as distribution-shift-problem (DSP). To address the DSP, we propose Multi-Head Feature Collaboration (MHFC) algorithm, which attempts to project the multi-head features (e.g., multiple features extracted from a variety of FEMs) to a unified space and fuse them to capture more discriminative information. Typically, first, we introduce a subspace learning method to transform the multi-head features to aligned low-dimensional representations. It corrects the DSP via learning the feature with more powerful discrimination and overcomes the problem of inconsistent measurement scales from different head features. Then, we design an attention block to update combination weights for each head feature automatically. It comprehensively considers the contribution of various perspectives and further improves the discrimination of features. We evaluate the proposed method on five benchmark datasets (including cross-domain experiments) and achieve significant improvements of 2.1%-7.8% compared with state-of-the-arts.

Abstract:
Traditional image/video compression aims to reduce the transmission/storage cost with signal fidelity as high as possible. However, with the increasing demand for machine analysis and semantic monitoring in recent years, semantic fidelity rather than signal fidelity is becoming another emerging concern in image/video compression. With the recent advances in cross modal translation and generation, in this paper, we propose the cross modal compression~(CMC), a semantic compression framework for visual data, to transform the high redundant visual data~(such as image, video, etc.) into a compact, human-comprehensible domain~(such as text, sketch, semantic map, attributions, etc.), while preserving the semantic. Specifically, we first formulate the CMC problem as a rate-distortion optimization problem. Secondly, we investigate the relationship with the traditional image/video compression and the recent feature compression frameworks, showing the difference between our CMC and these prior frameworks. Then we propose a novel paradigm for CMC to demonstrate its effectiveness. The qualitative and quantitative results show that our proposed CMC can achieve encouraging reconstructed results with an ultrahigh compression ratio, showing better compression performance than the widely used JPEG baseline.

Abstract:
Automatically describing video, or video captioning, has been widely studied in the multimedia field. This paper proposes a new task of sensor-augmented egocentric-video captioning, a newly constructed dataset for it called MMAC Captions, and a method for the newly proposed task that effectively utilizes multi-modal data of video and motion sensors, or inertial measurement units (IMUs). While conventional video captioning tasks have difficulty in dealing with detailed descriptions of human activities due to the limited view of a fixed camera, egocentric vision has greater potential to be used for generating the finer-grained descriptions of human activities on the basis of a much closer view. In addition, we utilize wearable-sensor data as auxiliary information to mitigate the inherent problems in egocentric vision: motion blur, self-occlusion, and out-of-camera-range activities. We propose a method for effectively utilizing the sensor data in combination with the video data on the basis of an attention mechanism that dynamically determines the modality that requires more attention, taking the contextual information into account. We compared the proposed sensor-fusion method with strong baselines on the MMAC Captions dataset and found that using sensor data as supplementary information to the egocentric-video data was beneficial, and that our proposed method outperformed the strong baselines, demonstrating the effectiveness of the proposed method.

Abstract:
With the continuous exploration of marine resources, underwater artificial intelligent robots play an increasingly important role in the fish industry. However, the detection of underwater objects is a very challenging problem due to the irregular movement of underwater objects, the occlusion of sand and rocks, the diversity of water illumination, and the poor visibility and low color contrast in the underwater environment. In this article, we first propose a real-world underwater object detection dataset (UODD), which covers more than 3K images of the most common aquatic products. Then we propose Channel Sharpening Attention Module (CSAM) as a plug-and-play module to further fuse high-level image information, providing the network with the privilege of selecting feature maps. Fusion of original images through CSAM can improve the accuracy of detecting small and medium objects, thereby improving the overall detection accuracy. We also use Water-Net as a preprocessing method to remove the haze and color cast in complex underwater scenes, which shows a satisfactory detection result on small-sized objects. In addition, we use the class weighted loss as the training loss, which can accurately describe the relationship between classification and precision of bounding boxes of targets, and the loss function converges faster during the training process. Experimental results show that the proposed method reaches a maximum AP of 50.1%, outperforming other traditional and state-of-the-art detectors. In addition, our model only needs an average inference time of 25.4 ms per image, which is quite fast and might suit the real-time scenario.

Abstract:
Training machine learning (ML) models is expensive in terms of computational power, amounts of labeled data and human expertise. Thus, ML models constitute business value for their owners. Embedding digital watermarks during model training allows a model owner to later identify their models in case of theft or misuse. However, model functionality can also be stolen via model extraction, where an adversary trains a surrogate model using results returned from a prediction API of the original model. Recent work has shown that model extraction is a realistic threat. Existing watermarking schemes are ineffective against model extraction since it is the adversary who trains the surrogate model. In this paper, we introduce DAWN (Dynamic Adversarial Watermarking of Neural Networks), the first approach to use watermarking to deter model extraction theft. Unlike prior watermarking schemes, DAWN does not impose changes to the training process but operates at the prediction API of the protected model, by dynamically changing the responses for a small subset of queries (e.g., 0.5%) from API clients. This set is a watermark that will be embedded in case a client uses its queries to train a surrogate model. We show that DAWN is resilient against two state-of-the-art model extraction attacks, effectively watermarking all extracted surrogate models, allowing model owners to reliably demonstrate ownership (with confidence greater than 1-2-64), incurring negligible loss of prediction accuracy (0.03-0.5%).

Abstract:
Identifying the language of the text in scene images is crucial for various applications. Studies that focus on identifying the script, which is a set of letters used for writing in a given language, in scene text images already exist. However, these works do not distinguish between different languages written in the same script and are thus unable to meet the needs of many applications. To address this challenge, we study a novel task: fine-grained language identification in scene text images, which aims to distinguish languages that share the same script. The datasets that include samples in seven languages, which are Dutch, English, French, Italian, German, Spanish, and Portuguese, are constructed. Furthermore, well-designed end-to-end trainable neural networks are proposed for fine-grained language identification, where semantic information concerning the text is mined and utilized to assist the language identification. We train the networks on the synthetic dataset and evaluate them with the collected real dataset. The experimental results demonstrate that the proposed frameworks are effective.

Abstract:
The rising popularity of Artificial Intelligence (AI) has brought considerable public interest as well faster and more direct transfer of research ideas into practice. One of the aspects of AI that still trails behind considerably is the role of machines in interpreting, enhancing, modeling, generating, and influencing social behavior. Such behavior is captured as social signals, usually by sensors recording multiple modalities, making it classic multimedia data. Such behavior can also be generated by an AI system when interacting with humans. Using AI techniques in combination with multimedia data can be used to pursue multiple goals, two of which are high-lighted here. First, supporting people during social interactions and helping them to fulfil their social needs either actively or passively.Second, improving our understanding of how people collaborate, build relationships, and process self identity. Despite the rise of fields such as Social Signal Processing, a similar panel organised at ACM Multimedia 2014, and an area on social and emotional signal sat the ACM MM since 2014, we argue that we have yet to truly fulfil the potential of the combining social signals and multimedia. This panel asks where we have come far enough and what remaining challenges there are in light of recent global events.

Abstract:
Multi-modal Ads Video Understanding Challenge is the first grand challenge aiming to comprehensively understand ads videos. Our challenge includes two tasks: video structuring and multi-label classification. Video structuring asks the participants to accurately predict both the scene boundaries and the multi-label categories of each scene based on a fine-grained and ads-related category hierarchy. This task will advance the foundation of comprehensive ads video understanding, which has a significant impact on many applications in ads, such as video recommendation and user behavior analysis. This paper presents an overview of the video structuring task in our grand challenge, including the background of ads videos, an elaborate description of this task, our proposed dataset, the evaluation protocol, and our baseline model. By ablating the key components of our baseline, we would like to reveal the main challenges of this task and provide useful guidance for future research of this area.

Abstract:
Shot boundary detection (SBD) plays an important role in video understanding, since most recent works take the shot as minimal granularity instead of frames for upstream tasks. However, the large variations of hard-cut and gradual-change transitions within shots significantly limit the performance of SBD. To deal with the variations, we propose a multi-task architecture called Transnet++. Transnet++ disentangles the two types of transition and adopts two separate branches to predict them respectively. Two branches share the same video knowledge space and their results are fused for final prediction. Moreover, we propose a spatial attention module (SAM) to enhance the feature representations which suffers from redundant padding region. Meanwhile, a temporal attention module (TAM) is applied to capture the long-term information of the video for alleviating the over-segmentation problem. Experimental results (91.16% f1-score) on Tencent AVS Dataset demonstrate the effectiveness and superiority of Transnet++ for SBD.

Abstract:
Video advertising is one of the most effective forms of advertisements because videos are more attractive, more persuasive, and more informative than images or texts. Increasing amounts of video advertisements requires faster and more intelligent technologies for video generation. We have developed a multi-modal video editing approach that can automatically generate advertising video clips from any source videos. We conduct chronological boundary segmentation of the original video and construct a weighted directed graph to assemble different segments. Experiments on our video editing datasets validate success of the proposed method in producing compelling and consistent advertising videos.

Abstract:
There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal system to improve the ability of structured analysis of advertising video content. In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build upon a temporal convolution module for temporal modeling to predict whether adjacent frames belong to the same scene. In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual features are further complemented with textual features that are derived using a global-local attention mechanism to extract useful information from OCR (Optical Character Recognition) and ASR (Audio Speech Recognition) outputs. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.

Abstract:
Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer~(LMGT ), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods.

Abstract:
Temporal action detection aims to locate specific segments of action instances in an untrimmed video. Most existing approaches commonly extract the features of all candidate video segments and then classify them separately. However, they may neglect the underlying relationship among candidates unconsciously. In this paper, we propose a novel model termed Candidate-Aware Aggregation (CAA) to tackle this problem. In CAA, we design the Global Awareness (GA) module to exploit long-range relations among all candidates from a global perspective, which enhances the features of action instances. The GA module is then embedded into a multi-level hierarchical network named FENet, to aggregate local features in adjacent candidates to suppress background noise. As a result, the relationship among candidates is explicitly captured from both local and global perspectives, which ensures more accurate prediction results for the candidates. Extensive experiments conducted on two popular benchmarks ActivityNet-1.3 and THUMOS-14 demonstrate the superiority of CAA comparing to the state-of-the-art methods.

Abstract:
Describing images using natural language is widely known as image captioning, which has made consistent progress due to the development of computer vision and natural language generation techniques. Though conventional captioning models achieve high accuracy based on popular metrics, i.e., BLEU, CIDEr, and SPICE, the ability of captions to distinguish the target image from other similar images is under-explored. To generate distinctive captions, a few pioneers employ contrastive learning or re-weighted the ground-truth captions, which focuses on one single input image. However, the relationships between objects in a similar image group (e.g., items or properties within the same album or fine-grained events) are neglected. In this paper, we improve the distinctiveness of image captions using a Group-based Distinctive Captioning Model (GdisCap), which compares each image with other images in one similar group and highlights the uniqueness of each image. In particular, we propose a group-based memory attention (GMA) module, which stores object features that are unique among the image group (i.e., with low similarity to objects in other images). These unique object features are highlighted when generating captions, resulting in more distinctive captions. Furthermore, the distinctive words in the ground-truth captions are selected to supervise the language decoder and GMA. Finally, we propose a new evaluation metric, distinctive word rate (DisWordRate) to measure the distinctiveness of captions. Quantitative results indicate that the proposed method significantly improves the distinctiveness of several baseline models, and achieves the state-of-the-art performance on both accuracy and distinctiveness. Results of a user study agree with the quantitative evaluation and demonstrate the rationality of the new metric DisWordRate.

Abstract:
Mathematical expression recognition (MER) aims to convert an image of mathematical expressions into a Latex sequence. In practice, the task of MER is challenging, since 1) the images of mathematical expressions often contain complex structure relationships, e.g., fractions, matrixes, and subscripts; 2) the generated Latex sequences can be very complex and they have to satisfy strict syntax rules. Existing methods, however, often ignore the complex dependence among image regions, resulting in poor feature representation. In addition, they may fail to capture the rigorous relations among different formula symbols as they consider MER as a common language generation task. To address these issues, we propose a Structure-Aware Sequence-Level (SASL) model for MER. First, to better represent and recognize the visual content of formula images, we propose a structure-aware module to capture the relationship among different symbols. Meanwhile, the sequence-level modeling helps the model to concentrate on the generation of entire sequences. To make the problem feasible, we cast the generation problem into a Markov decision process (MDP) and seek to learn a Latex sequence generating policy. Based on MDP, we learn SASL by maximizing the matching score of each image-sequence pair to obtain the generation policy. Extensive experiments on the IM2LATEX-100K dataset verify the effectiveness and superiority of the proposed method.

Abstract:
Referring expression comprehension aims to localize the target object in an image referred by a natural language expression. Most existing approaches neglect the implicit logical correlations among fine-grained cues, e.g., categories, attributes, which are beneficial for distinguishing objects. In this paper, we propose a logic-guided approach to explore logical knowledge for referring expression comprehension in a hierarchical modular-based framework. Specifically, we propose to extract fine-grained cues in visual and textual domains and perform logical reasoning over them with explicit logical expressions to regularize the matching process without extra parameters. Besides, we propose to improve existing modular-based methods by introducing context information of objects in the relationship module. Extensive experiments are conducted on three referring expression datasets, and the results demonstrate that our model can produce more consistent predictions and further achieve superior performance compared with previous methods.

Abstract:
Recently deep neural networks (DNNs) have achieved significant success in real-world image super-resolution (SR). However, adversarial image samples with quasi-imperceptible noises could threaten deep learning SR models. In this paper, we propose a robust deep learning framework for real-world SR that randomly erases potential adversarial noises in the frequency domain of input images or features. The rationale is that on the SR task clean images or features have a different pattern from the attacked ones in the frequency domain. Observing that existing adversarial attacks usually add high-frequency noises to input images, we introduce a novel random frequency mask module that blocks out high-frequency components possibly containing the harmful perturbations in a stochastic manner. Since the frequency masking may not only destroys the adversarial perturbations but also affects the sharp details in a clean image, we further develop an adversarial sample classifier based on the frequency domain of images to determine if applying the proposed mask module. Based on the above ideas, we devise a novel real-world image SR framework that combines the proposed frequency mask modules and the proposed adversarial classifier with an existing super-resolution backbone network. Experiments show that our proposed method is more insensitive to adversarial attacks and presents more stable SR results than existing models and defenses.

Abstract:
The task of video-based commonsense captioning aims to generate event-wise captions and meanwhile provide multiple commonsense descriptions (e.g., attribute, effect and intention) about the underlying event in the video. Prior works explore the commonsense captions by using separate networks for different commonsense types, which is time-consuming and lacks mining the interaction of different commonsense. In this paper, we propose a Hybrid Reasoning Network (HybridNet) to endow the neural networks with the capability of semantic-level reasoning and word-level reasoning. Firstly, we develop multi-commonsense learning for semantic-level reasoning by jointly training different commonsense types in a unified network, which encourages the interaction between the clues of multiple commonsense descriptions, event-wise captions and videos. Then, there are two steps to achieve the word-level reasoning: (1) a memory module records the history predicted sequence from the previous generation processes; (2) a memory-routed multi-head attention (MMHA) module updates the word-level attention maps by incorporating the history information from the memory module into the transformer decoder for word-level reasoning. Moreover, the multimodal features are used to make full use of diverse knowledge for commonsense reasoning. Experiments and abundant analysis on the large-scale Video-to-Commonsense benchmark show that our HybridNet achieves state-of-the-art performance compared with other methods.

Abstract:
Graph convolutional networks (GCN) is widely used to handle irregular data since it updates node features by using the structure information of graph. With the help of iterated GCN, high-order information can be obtained to further enhance the representation of nodes. However, how to apply GCN to structured data (such as pictures) has not been deeply studied. In this paper, we explore the application of graph attention networks (GAT) in image feature extraction. First of all, we propose a novel graph generation algorithm to convert images into graphs through matrix transformation. It is one magnitude faster than the algorithm based on K Nearest Neighbors (KNN). Then, GAT is used on the generated graph to update the node features. Thus, a more robust representation is obtained. These two steps are combined into a module called pixel-wise graph attention module (PGA). Since the graph obtained by our graph generation algorithm can still be transformed into a picture after processing, PGA can be well combined with CNN. Based on these two modules, we consulted the ResNet and design a pixel-wise graph attention network (PGANet). The PGANet is applied to the task of person re-identification in the datasets Market1501, DukeMTMC-reID and Occluded-DukeMTMC (outperforms state-of-the-art by 0.8%, 1.1% and 11% respectively, in mAP scores). Experiment results show that it achieves the state-of-the-art performance.

Abstract:
Few-shot Semantic Segmentation (FSS) is a challenging problem in computer vision. It aims at segmenting objects of the unseen categories given only one or several annotated samples. The essence of FSS is to disseminate information from support images to query images for segmenting the mutual object categories. In this paper, we propose a Dynamic Reasoning Network (DRNet) to adaptively generate the parameters of predicting layers and infer the segmentation mask for each unseen category. More specifically, an Attentional Feature Integration Sub-network (AFIS) is first proposed to extract consistent features from support im-ages and query images. With shared weights, it stimulates the category consistency of different data streams. Then a Pooling-based Guidance Module (PGM) is used to cor-relate support features with query features progressively. To disseminate information from support images to various query images, we further propose a Dynamic PredictionModule (DPM) for generating the parameters of predicting layers. The proposed modules are unified for the dynamic reasoning of each query image segmentation. Experiments on two public benchmarks have demonstrated that our approach achieves superior performance and outperforms thevery recent state-of-the-art methods.

Abstract:
In outdoor crimes such as robbery and kidnapping, suspects generally secretly follow their victims in public places and then look for opportunities to commit crimes. Video anomaly detection (VAD) has achieved fruitful results through deep neural networks (DNN). However, as an abnormal behavior without obvious abnormal physical features, hidden following is highly similar to ordinary walking and accompanying behaviors, so it is difficult to effectively detect hidden dangerous followers using video anomaly detection methods or traditional trajectory analysis methods. We propose "hidden follower'' detection (HFD) task and a HFD model based on gaze pattern extraction. It extracts gaze pattern features of pedestrians from gaze-interval-series and introduces a time series classification model to classify pedestrians with or without hidden following purposes. Based on this model, we propose a hidden follower detection framework (HFDF) to detect hidden followers from normal pedestrians, which utilizes the trajectories and gaze patterns extracted from videos. To cope with the lack of test data, we construct a dataset of 1200 pedestrians from the crowd simulation model to simulate scenes including hidden followers, and we also collected a surveillance video dataset including the hidden following behaviors. The experiments conducted on these two datasets show that HFDF can consistently outperform the state-of-the-art method by a notable margin in the HFD task on the commonly-used F1 benchmark.

Abstract:
On account of a large scale of dataset need to be annotated to train the deep learning based modern object detection model, zero-shot object detection has become an important research field which aims to simultaneously localize and recognize unseen objects that are not observed during training. In order to improve the performance of zero-shot object detection, recent state of the art methods tend to make complicated modifications to the modern object detectors in terms of the model structure, loss function and training process. They always take the simple modification as a baseline, and think it is worse than more complicated methods. In contrast, we find that simple modification can achieve better performance. Considering that the redundant modification may increase the risk of over-fitting in seen classes and reduce generalization performance on unseen classes, we propose a visual language based succinct zero-shot object detection framework, which only replaces the classification branch in the modern object detector with a lightweight visual-language network. Since zero-shot object detection is a classic multi-modal learning protocol which consists of a visual feature space and a language space, our visual-language network learns the visual language alignment from the image and language data of seen classes and transfers this alignment to detect unseen objects. Following the Occam's razor principle that "Entities should not be multiplied unnecessarily", extensive experimental results show that our succinct framework can suppress all existing zero-shot object detection methods on several benchmarks and gets the new state-of-the-art.

Abstract:
This tutorial provides the audience with the basic theories, methodologies, and current progresses of image quality assessment (IQA). From an actionable perspective, we will first revisit several subjective quality assessment methodologies, with emphasis on how to properly select visual stimuli. We will then present in detail the design principles of objective quality assessment models, supplemented by an in-depth analysis of their advantages and disadvantages. Both hand-engineered and (deep) learning-based methods will be covered. Moreover, the limitations with the conventional model comparison methodology for objective quality models will be pointed out, and novel comparison methodologies such as those based on the theory of "analysis by synthesis" will be introduced. We will last discuss the real-world multimedia applications of IQA, and give a list of open challenging problems, in the hope of encouraging more and more talented researchers and engineers devoting to this exciting and rewarding research field.

Abstract:
Plenoptic representations, like light fields, point clouds or digital holography, provide the means for 3D representations suitable for multiple immersive and computer vision applications. JPEG has been standardizing coding tools for these types of plenoptic data in its project JPEG Pleno. This standardization effort has been developing quality assessment models suitable for the quality evaluation of the coding technologies. In this tutorial the quality assessment methodologies defined for the evaluation of the different proposals of the three plenoptic modalities, are explained. The tutorial also includes possible alternatives considered in the definition of the quality assessment models and the selection of appropriate anchors decided during JPEG Pleno development process.

Abstract:
Out-of-distribution generalization is becoming a hot research topic in both academia and industry. This tutorial is to disseminate and promote the recent research achievements on out-of-distribution generalization as well as their applications on multimedia, which is an exciting and fast-growing research direction in the general field of machine learning and multimedia. We will advocate novel, high-quality research findings, as well as innovative solutions to the challenging problems in out-of-distribution generalization and its applications for multimedia. This topic is at the core of the scope of ACM Multimedia, and is attractive to MM audience from both academia and industry.

Abstract:
Stereo matching is a fundamental and challenging task which has various applications in autonomous driving, dense reconstruction and other depth related tasks. Contextual information with discriminative features is crucial for accurate stereo matching in the ill-posed regions (textureless, occlusion, etc.). In this paper, we propose an efficient horizontal attention module to adaptively capture the global correspondence clues. Compared with the popular non-local attention, our horizontal attention is more effective for stereo matching with better performance and lower consumption of computation and memory. We further introduce a deformable module to refine the contextual information in the disparity discontinuous areas such as the boundary of objects. Learning-based method is adopted to construct the cost volume by concatenating the features of two branches. In order to offer explicit similarity measure to guide learning-based volume for obtaining more reasonable unimodal matching cost distribution we additionally combine the learning-based volume with the improved zero-centered group-wise correlation volume. Finally, we regularize the 4D joint cost volume by a 3D CNN module and generate the final output by disparity regression. The experimental results show that our proposed HDA-Net achieves the state-of-the-art performance on the Scene Flow dataset and obtains competitive performance on the KITTI datasets compared with the relevant networks.

Abstract:
Image inpainting is an underdetermined inverse problem, which naturally allows diverse contents to fill up the missing or corrupted regions realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable to model long-range dependencies and generate diverse contents with autoregressive modeling of pixel-sequence distributions. However, the unidirectional attention in autoregressive transformers is suboptimal as corrupted image regions may have arbitrary shapes with contexts from any direction. We propose BAT-Fill, an innovative image inpainting framework that introduces a novel bidirectional autoregressive transformer (BAT) for image inpainting. BAT utilizes the transformers to learn autoregressive distributions, which naturally allows the diverse generation of missing contents. In addition, it incorporates the masked language model like BERT, which enables bidirectionally modeling of contextual information of missing regions for better image completion. Extensive experiments over multiple datasets show that BAT-Fill achieves superior diversity and fidelity in image inpainting qualitatively and quantitatively.

Abstract:
In recent years, the metaverse has attracted enormous attention from around the world with the development of related technologies. The expected metaverse should be a realistic society with more direct and physical interactions, while the concepts of race, gender, and even physical disability would be weakened, which would be highly beneficial for society. However, the development of metaverse is still in its infancy, with great potential for improvement. Regarding metaverse's huge potential, industry has already come forward with advance preparation, accompanied by feverish investment, but there are few discussions about metaverse in academia to scientifically guide its development. In this paper, we highlight the representative applications for social good. Then we propose a three-layer metaverse architecture from a macro perspective, containing infrastructure, interaction, and ecosystem. Moreover, we journey toward both a historical and novel metaverse with a detailed timeline and table of specific attributes. Lastly, we illustrate our implemented blockchain-driven metaverse prototype of a university campus and discuss the prototype design and insights.

Abstract:
Designing visually appealing layouts for multimedia documents containing text, graphs and images requires a form of creative intelligence. Modelling the generation of layouts has recently gained attention due to its importance in aesthetics and communication style. In contrast to standard prediction tasks, there are a range of acceptable layouts which depend on user preferences. For example, a poster designer may prefer logos on the top-left while another prefers logos on the bottom-right. Both are correct choices yet existing machine learning models treat layouts as a single choice prediction problem. In such situations, these models would simply average over all possible choices given the same input forming a degenerate sample. In the above example, this would form an unacceptable layout with a logo in the centre.

Abstract:
Image enhancement aims to improve the aesthetic quality of images. Most enhancement methods are based on image decomposition techniques. For example, an entire image can be decomposed into a smooth base layer and a residual detail layer. Applying appropriate algorithms to different layers can solve most enhancement problems. Besides decomposing the entire image, the local decomposition approach in local Laplacian filter can also achieve satisfied enhancement results. As a standard convolution is also a local operator that the output values is determined by neighborhood pixels, we observe that the standard convolution can be improved by integrating the local decomposition method for better solving image enhancement problems. Based on this analysis, we propose Windowing Decomposition Convolution (WDC) that decomposes the content of each convolution window by a windowing basic value before applying convolution operation. Using different windowing basic values, the WDC can gather global information and locally separate the processing of different components of images. Moreover, combined with WDC, a new Windowing Decomposition Convolutional Neural Network (WDCNN) is presented. The experimental results show that our WDCNN achieves superior enhancement performance on the MIT-Adobe FiveK and sRGB-SID datasets for noise-free image retouching and low-light noisy image enhancement compared with state-of-the-art techniques.

Abstract:
Utilizing an arbitrary speech clip to edit the mouth of the portrait in the target video is a novel yet challenging task. Despite impressive results have been achieved, there are still three limitations in the existing methods: 1) since the acoustic features are not completely decoupled from person identity, there is no global speech to facial features (i.e., landmarks, expression blendshape) mapping method. 2) the audio-driven talking face sequences generated by simple cascade structure usually lack of temporal consistency and spatial correlation, which leads to defects in the consistency of changes in details. 3) the operation of forgery is always at the video level, without considering the forgery of the voice, especially the synchronization of the converted voice and the mouth. To address these distortion problems, we propose a novel deep learning framework, named Temporal-Refinement Autoregressive-Cascade Rendering Network (TACR-Net) for audio-driven dynamic talking face editing. The proposed TACR-Net encodes facial expression blendshape based on the given acoustic features without separately training for special video. Then TACR-Net also involves a novel autoregressive cascade structure generator for video re-rendering. Finally, we transform the in-the-wild speech to the target portrait and obtain a photo-realistic and audio-realistic video.

Abstract:
Weakly supervised object localization (WSOL) has gained recent popularity, which seeks to train localizers with only image-level labels. However, due to relying heavily on classification objective for training, prevailing WSOL methods only localize discriminative parts of object, ignoring other useful information, such as the wings of a bird, and suffer from severe rotation variations. Moreover, learning object localization imposes CNNs to attend non-salient regions under weak supervision, which may negatively influence image classification results. To address these challenges, this paper proposes a novel end-to-end Excitation-Expansion network, coined as E^2Net, to localize entire objects with only image-level labels, which served as the base of most multimedia tasks. The proposed E^2Net consists of two key components: Maxout-Attention Excitation (MAE) and Orientation-Sensitive Expansion (OSE). Firstly, MAE module aims to activate non-discriminative localization features while simultaneously recovering discriminative classification cues. To this end, we couple erasing strategy with maxout learning efficiently to facilitate entire-object localization without hurting classification accuracy. Secondly, to address rotation variations, the proposed OSE module expands less salient object parts along with all possible orientations. Particularly, OSE module dynamically combines selective attention banks from various orientated expansions of receptive-field, which introduces additional multi-parallel localization heads. Extensive experiments on ILSVRC 2012 and CUB-200-2011 demonstrate that the proposed E^2Net outperforms the previous state-of-the-art WSOL methods and also significantly improves classification performance.

Abstract:
Zero-shot cross-domain crowd counting is a challenging task where a crowd counting model is trained on a source domain (i.e., training dataset) and no additional labeled or unlabeled data is available for fine-tuning the model when testing on an unseen target domain (i.e., a different testing dataset). The generalisation performance of existing crowd counting methods is typically limited due to the large gap between source and target domains. Here, we propose a novel Crowd Counting framework built upon an external Momentum Template, termed C2MoT, which enables the encoding of domain specific information via an external template representation. Specifically, the Momentum Template (MoT) is learned in a momentum updating way during offline training, and then is dynamically updated for each test image in online cross-dataset evaluation. Thanks to the dynamically updated MoT, our C2MoT effectively generates dense target correspondences that explicitly accounts for head regions, and then effectively predicts the density map based on the normalized correspondence map. Experiments on large scale datasets show that our proposed C2MoT achieves leading zero-shot cross-domain crowd counting performance without model fine-tuning, while also outperforming domain adaptation methods that use fine-tuning on target domain data. Moreover, C2MoT also obtains state-of-the-art counting performance on the source domain.

Abstract:
Multimodal dialog system has attracted increasing attention from both academia and industry over recent years. Although existing methods have achieved some progress, they are still confronted with challenges in the aspect of question understanding (i.e., user intention comprehension). In this paper, we present a relational graph-based context-aware question understanding scheme, which enhances the user intention comprehension from local to global. Specifically, we first utilize multiple attribute matrices as the guidance information to fully exploit the product-related keywords from each textual sentence, strengthening the local representation of user intentions. Afterwards, we design a sparse graph attention network to adaptively aggregate effective context information for each utterance, completely understanding the user intentions from a global perspective. Moreover, extensive experiments over a benchmark dataset show the superiority of our model compared with several state-of-the-art baselines.

Abstract:
We investigate the problem of weakly-supervised video object grounding (WSVOG), where only the video-sentence annotations are provided for training. It aims at localizing the queried objects described in the sentence to visual regions in the video. Despite the recent progress, existing approaches have not fully exploited the potential of the description sentences for cross-modal alignment in two aspects: (1) Most of them extract objects from the description sentences and represent them with fixed textual representations. While achieving promising results, they do not make full use of the contextual information in the sentence. (2) A few works have attempted to utilize contextual information to learn object representations, but found a significant decrease in performance due to the unstable training in cross-modal alignment. To address the above issues, in this paper, we propose a Stable Context Learning (SCL) framework for WSVOG which jointly enjoys the merits of stable learning and rich contextual information. Specifically, we design two modules named Context-Aware Object Stabilizer module and Cross-Modal Alignment Knowledge Transfer module, which are cooperated together to inject contextual information to stable object concepts in text modality and transfer contextualized knowledge in cross-modal alignment. Our approach is finally optimized under a frame-level MIL paradigm. Extensive experiments on three popular benchmarks demonstrate its significant effectiveness.

Abstract:
Existing point cloud classifiers concern on handling irregular data structures to discover a global and discriminative configuration of local geometries. These classification methods design a number of effective permutation-invariant feature encoding kernels, but still suffer from the intrinsic challenge of large geometric feature variations caused by inconsistent point distributions along object surface. In this paper, point cloud classification can be addressed via deep graph representation learning on aggregating multiple convolutional feature kernels (namely, a poly convolutional operation) anchored on each point with its local neighbours. Inspired by recent success of neural architecture search, we introduce a novel concept of poly-convolutional architecture search (PolyConv search in short) to model local geometric patterns in a more flexible manner.

Abstract:
Few-shot action recognition has drawn growing attention as it can recognize novel action classes by using only a few labeled samples. In this paper, we propose a novel semantic-guided relation propagation network (SRPN), which leverages semantic information together with visual information for few-shot action recognition. Different from most previous works that neglect semantic information in the labeled data, our SRPN directly utilizes the semantic label as an additional supervisory signal to improve the generalization ability of the network. Besides, we treat the relation of each visual-semantic pair as a relational node, and we use a graph convolutional network to model and propagate such sample relations across visual-semantic pairs, including both intra-class commonality and inter-class uniqueness, to guide the relation propagation in the graph. However, since videos contain crucial sequences and ordering information, we propose a novel spatial-temporal difference module, which can facilitate the network to enhance the visual feature learning ability at both feature level and granular level for videos. Extensive experiments conducted on several challenging benchmarks demonstrate that our SRPN outperforms several state-of-the-art methods with a significant margin.

Abstract:
Weakly supervised temporal action localization (WTAL) is a challenging task as only video-level category labels are available during training stage. Without precise temporal annotations, most approaches rely on complementary RGB and optical flow features to predict the start and end frame of each action category in a video. However, existing approaches simply resort to either concatenation or weighted sum to learn how to take advantages of these two modalities for accurate action localization, which ignore the substantial variance between such two modalities. In this paper, we present Cross-Stream Collaborative Learning (CSCL) to address these issues. The proposed CSCL introduce a cross-stream weighting module to identify which modality is more robust during training and take advantage of the robust modality to guide the weaker one. Furthermore, we suppress the snippets which has high action-ness scores in both modalities to further exploiting the complementary property between two modalities. In addition, we bring the concept of co-training for WTAL and take both modalities into account for pseudo label generation to help training a stronger model. Extensive experiments conducted on THUMOS14 and ActivityNet dataset demonstrate that CSCL achieves a favorable performance against state-of-the-arts methods.

Abstract:
The world has become multimodal. In addition to text, we have been sharing a huge amount of multimedia information in the form of images and videos on the Internet. The wide spread use of smart mobile devices has also changed the way we interact with the Internet. It is now natural for us to capture images and videos freely and use as part of a query, in addition to the traditional text and voices. These, along with the rapid advancements in multimedia, natural language processing, information retrieval, and conversation technologies, mean that it is time for us to explore multimodal conversation and its roles in search and recommendation. Multimodal conversation has the potential to help us to uncover and digest the huge amount of multimedia information and knowledge hidden within many systems. It also enables a natural 2-way interactions between humans and machines, with mutual benefits in enriching their respective knowledge. Finally, it opens up the possibilities of disrupting many existing applications and launching new innovative applications. This panel is timely and aims to explore this emerging trend, and discuss its potential benefits and pitfalls to society. The panel will also explore the limitations of current technologies and highlight future research directions towards developing a multimedia conversational system.

Abstract:
This paper aims to advance the performance of industrial ASR systems by exploring a more effective method for N-best rescoring, a critical step that greatly affects the final recognition accuracy. Existing rescoring approaches suffer the following issues: (i) limited performance since they optimize an unnecessarily harder problem, namely predicting accurate grammatical legitimacy scores of the N-best hypotheses rather than directly predicting their partial orders regarding a specific acoustic input; (ii) hard to incorporate various information by advanced natural language processing (NLP) models such as BERT to achieve a comprehensive evaluation of each N-best candidate. To relieve the above drawbacks, we propose a simple yet effective mechanism, Learning-to-Rescore (L2RS), to empower ASR systems with state-of-the-art information retrieval (IR) techniques. Specifically, L2RS utilizes a wide range of textual information from the state-of-the-art NLP models and automatically deciding their weights to directly learn the ranking order of each N-best hypothesis with respect to a specific acoustic input. We incorporate various features including BERT sentence embeddings, the topic vectors, and perplexity scores produced by an n-gram language model (LM), topic modeling LM, BERT, and RNNLM to train the rescoring model. Experimental results on a public dataset show that L2RS outperforms not only traditional rescoring methods but also its deep neural network counterparts by a substantial margin of 20.85% in terms of NDCG@10. The L2RS toolkit has been successfully deployed for many online commercial services in WeBank Co., Ltd, China's leading digital bank. The efficacy and applicability of L2RS are validated by real-life online customer datasets.

Abstract:
The goal of unsupervised domain adaptation is to learn a task classifier that performs well for the unlabeled target domain by borrowing rich knowledge from a well-labeled source domain. Although remarkable breakthroughs have been achieved in learning transferable representation across domains, two bottlenecks remain to be further explored. First, many existing approaches focus primarily on the adaptation of the entire image, ignoring the limitation that not all features are transferable and informative for the object classification task. Second, the features of the two domains are typically aligned without considering the class labels; this can lead the resulting representations to be domain-invariant but non-discriminative to the category. To overcome the two issues, we present a novel Informative Class-Conditioned Feature Alignment (IC2FA) approach for UDA, which utilizes a twofold method: informative feature disentanglement and class-conditioned feature alignment, designed to address the above two challenges, respectively. More specifically, to surmount the first drawback, we cooperatively disentangle the two domains to obtain informative transferable features; here, Variational Information Bottleneck (VIB) is employed to encourage the learning of task-related semantic representations and suppress task-unrelated information. With regard to the second bottleneck, we optimize a new metric, termed Conditional Sliced Wasserstein Distance (CSWD), which explicitly estimates the intra-class discrepancy and the inter-class margin. The intra-class and inter-class CSWDs are minimized and maximized, respectively, to yield the domain-invariant discriminative features. IC2FA equips class-conditioned feature alignment with informative feature disentanglement and causes the two procedures to work cooperatively, which facilitates informative discriminative features adaptation. Extensive experimental results on three domain adaptation datasets confirm the superiority of IC2FA.

Abstract:
Self-supervised representation learning for visual pre-training has achieved remarkable success with sample (instance or pixel) discrimination and semantics discovery of instance, whereas there still exists a non-negligible gap between pre-trained model and downstream dense prediction tasks. Concretely, these downstream tasks require more accurate representation, in other words, the pixels from the same object must belong to a shared semantic category, which is lacking in the previous methods. In this work, we present Dense Semantic Contrast (DSC) for modeling semantic category decision boundaries at a dense level to meet the requirement of these tasks. Furthermore, we propose a dense cross-image semantic contrastive learning framework for multi-granularity representation learning. Specially, we explicitly explore the semantic structure of the dataset by mining relations among pixels from different perspectives. For intra-image relation modeling, we discover pixel neighbors from multiple views. And for inter-image relations, we enforce pixel representation from the same semantic class to be more similar than the representation from different classes in one mini-batch. Experimental results show that our DSC model outperforms state-of-the-art methods when transferring to downstream dense prediction tasks, including object detection, semantic segmentation, and instance segmentation. Code will be made available.

Abstract:
Multi-modal hashing makes an important contribution to multimedia retrieval, where a key challenge is to encode heterogeneous modalities into compact hash codes. To solve this dilemma, graph-based multi-modal hashing methods generally define individual affinity matrix of each independent modality and apply linear algorithm for heterogeneous modalities fusion and compact hash learning. Several other methods construct graph Laplacian matrix based on semantic information to help learn discriminative hash code. However, these conventional methods roughly ignore the structural similarity of training set and the complex relations among multi-modal samples, which leads to unsatisfactory complementarity of fused hash codes. More notably, they are faced with two other important problems: huge computing and storage costs caused by graph construction and partial modality feature lost problem when incomplete query sample comes. In this paper, we propose a Flexible Graph Convolutional Multi-modal Hashing (FGCMH) method that adopts GCNs with linear complexity to preserve both the modality-individual and modality-fused structural similarity for discriminative hash learning. Necessarily, accurate multimedia retrieval can be performed on complete and incomplete datasets with our method. Specifically, multiple modality-individual GCNs under semantic guidance are proposed to act on each individual modality independently for intra-modality similarity preserving, then the output representations are fused into a fusion graph with adaptive weighting scheme. Hash GCN and semantic GCN, which share parameters in the first two layers, propagate fusion information and generate hash codes under high-level label space supervision. In the query stage, our method adaptively captures various multi-modal contents in a flexible and robust way, even if partial modality features are lost. Experimental results on three publicly datasets show the flexibility and effectiveness of our proposed method.

Abstract:
Heraclitus's Forest is an interactive artwork that utilizes birch trees as a metaphor for the life stories recorded in an oral history database. We design a day/night cycle system to present the forest experience along the time elapse, multiple interaction modes to engage audiences' participation in history exploration, and evolving forest to arouse people's reflection on the feature of history, which is constantly being constructed but can never be returned to.

Abstract:
With various face presentation attacks arising under unseen scenarios, face anti-spoofing (FAS) based on domain generalization (DG) has drawn growing attention due to its robustness. Most existing methods utilize DG frameworks to align the features to seek a compact and generalized feature space. However, little attention has been paid to the feature extraction process for the FAS task, especially the influence of normalization, which also has a great impact on the generalization of the learned representation. To address this issue, we propose a novel perspective of face anti-spoofing that focuses on the normalization selection in the feature extraction process. Concretely, an Adaptive Normalized Representation Learning (ANRL) framework is devised, which adaptively selects feature normalization methods according to the inputs, aiming to learn domain-agnostic and discriminative representation. Moreover, to facilitate the representation learning, Dual Calibration Constraints are designed, including Inter-Domain Compatible loss and Inter-Class Separable loss, which provide a better optimization direction for generalizable representation. Extensive experiments and visualizations are presented to demonstrate the effectiveness of our method against the SOTA competitors.

Abstract:
Even the most impressive achievement in frontal face synthesis is challenged by large poses and low-quality data given one single side-view face. We propose a synthesizer called SuperFront GAN (SF-GAN) to accept one or more low-resolution (LR) faces at the input to then output a high-resolution (HR) frontal face with various poses and such to preserve identity information. SF-GAN includes intra-class and inter-class constraints, which allow it to learn an identity-preserving representation from multiple LR faces in an improved, comprehensive manner. We adopt an orthogonal loss as the intra-class constraint that diversifies the learned feature-space per subject. Hence, each sample is made to complement the others to its max ability. Additionally, a triplet loss is used as the inter-class constraint: it improves the discriminative power of the new representation, which, hence, maintains the identity information. Furthermore, we integrate a super-resolution (SR) side-view module as part of the SF-GAN to help preserve the finer details of HR side-views. This helps the model reconstruct the high-frequency parts of the face (i.e. periocular region, nose, and mouth regions). Quantitative and qualitative results demonstrate the superiority of SF-GAN. SF-GAN holds promise as a pre-processing step to normalize and align faces before passing to CV system for processing.

Abstract:
Face forgery by deepfake is widely spread over the internet and this raises severe societal concerns. In this paper, we propose a novel video transformer with incremental learning for detecting deepfake videos. To better align the input face images, we use a 3D face reconstruction method to generate UV texture from a single input face image. The aligned face image can also provide pose, eyes blink and mouth movement information that cannot be perceived in the UV texture image, so we use both face images and their UV texture maps to extract the image features. We present an incremental learning strategy to fine-tune the proposed model on a smaller amount of data and achieve better deepfake detection performance. The comprehensive experiments on various public deepfake datasets demonstrate that the proposed video transformer model with incremental learning achieves state-of-the-art performance in the deepfake video detection task with enhanced feature learning from the sequenced data.

Abstract:
The consensus of multiple views on the same data will provide extra regularization, thereby improving accuracy. Based on this idea, we proposed a novel Knowledge-Supervised Learning (KSL) method for person re-identification (Re-ID), which can improve the performance without introducing extra inference cost. Firstly, we introduce isomorphic auxiliary training strategy to conduct basic multiple views that simultaneously train multiple classifier heads of the same network on the same training data. The consensus constraints aim to maximize the agreement among multiple views. To introduce this regular constraint, inspired by knowledge distillation that paired branches can be trained collaboratively through mutual imitation learning. Three novel constraints losses are proposed to distill the knowledge that needs to be transferred across different branches: similarity of predicted classification probability for cosine space constraints, distance of embedding features for euclidean space constraints, hard sample mutual mining for hard sample space constraints. From different perspectives, these losses complement each other. Experiments on four mainstream Re-ID datasets show that a standard model with KSL method trained from scratch outperforms its ImageNet pre-training results by a clear margin. With KSL method, a lightweight model without ImageNet pre-training outperforms most large models. We expect that these discoveries can attract some attention from the current de facto paradigm of "pre-training and fine-tuning" in Re-ID task to the knowledge discovery during model training.

Abstract:
n recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community. The input to conditional image generation has evolved from image-only to multimodality. In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We propose a GAN-based method to tackle this problem. The key idea is to treat text as neural operators to locally modify the image feature. We show that the proposed model performs favorably against recent strong baselines on three public datasets. Specifically, it generates images of greater fidelity and semantic relevance, and when used as a image query, leads to better retrieval performance.

Abstract:
In this work, we address the task of video background music generation. Some previous works achieve effective music generation but are unable to generate melodious music specifically for a given video, and none of them considers the video-music rhythmic consistency. To generate the background music that matches the given video, we first establish the rhythmic relationships between video and background music. In particular, we connect timing, motion speed, and motion saliency from video with beat, simu-note density, and simu-note strength from music, respectively. We then propose CMT, a Controllable Music Transformer that enables the local control of the aforementioned rhythmic features, as well as the global control of the music genre and the used instrument specified by users. Objective and subjective evaluations show that the generated background music has achieved satisfactory compatibility with the input videos, and at the same time, impressive music quality.

Abstract:
Many factors affect speech intelligibility in face-to-face conversations. These factors lead conversation participants to speak louder and more distinctively, exposing the content to potential eavesdroppers. To address these issues, we introduce Theophany, a privacy-preserving framework for augmenting speech. Theophany establishes ad-hoc social networks between conversation participants to exchange contextual information, improving speech intelligibility in real-time. At the core of Theophany, we develop the first privacy perception model that assesses the privacy risk of a face-to-face conversation based on its topic, location, and participants. This framework allows to develop any privacy-preserving application for face-to-face conversation. We implement the framework within a prototype system that augments the speaker's speech with real-life subtitles to overcome the loss of contextual cues brought by mask-wearing and social distancing during the COVID-19 pandemic. We evaluate Theophany through a user survey and a user study on 53 and 17 participants, respectively. Theophany's privacy predictions match the participants' privacy preferences with an accuracy of 71.26%. Users considered Theophany to be useful to protect their privacy (3.88/5), easy to use (4.71/5), and enjoyable to use (4.24/5). We also raise the question of demographic and individual differences in the design of privacy-preserving solutions.

Abstract:
In this paper, we place the atomic action detection problem intoa Long-Short Term Context (LSTC) to analyze how the temporalreliance among video signals affect the action detection results. Todo this, we decompose the action recognition pipeline into short-term and long-term reliance, in terms of the hypothesis that the twokinds of context are conditionally independent given the objectiveaction instance. Within our design, a local aggregation branch isutilized to gather dense and informative short-term cues, while ahigh order long-term inference branch is designed to reason theobjective action class from high-order interaction between actor andother person or person pairs. Both branches independently predictthe context-specific actions and the results are merged in the end.We demonstrate that both temporal grains are beneficial to atomicaction recognition. On the mainstream benchmarks of atomic actiondetection, our design can bring significant performance gain fromthe existing state-of-the-art pipeline.

Abstract:
Point clouds obtained from 3D sensors are usually sparse. Existing methods mainly focus on upsampling sparse point clouds in a supervised manner by using dense ground truth point clouds. In this paper, we propose a self-supervised point cloud upsampling network (SSPU-Net) to generate dense point clouds without using ground truth. To achieve this, we exploit the consistency between the input sparse point cloud and generated dense point cloud for the shapes and rendered images. Specifically, we first propose a neighbor expansion unit (NEU) to upsample the sparse point clouds, where the local geometric structures of the sparse point clouds are exploited to learn weights for point interpolation. Then, we develop a differentiable point cloud rendering unit (DRU) as an end-to-end module in our network to render the point cloud into multi-view images. Finally, we formulate a shape-consistent loss and an image-consistent loss to train the network so that the shapes of the sparse and dense point clouds are as consistent as possible. Extensive results on the CAD and scanned datasets demonstrate that our method can achieve impressive results in a self-supervised manner.

Abstract:
Video object detection is the task of detecting objects in a sequence of frames, typically, with a significant overlap in content among consecutive frames. Mean Average Precision (mAP) was originally proposed for evaluating object detection techniques in independent frames, but has been used for evaluating video based object detectors as well. This is undesirable since the average precision over all frames masks the biases that a certain object detector might have against certain types of objects depending on the number of frames for which the object is present in a video sequence. In this paper we show several disadvantages of mAP as a metric for evaluating video based object detection. Specifically, we show that: (a) some object detectors could be severely biased against some specific kind of objects, such as small, blurred, or low contrast objects, and such differences may not reflect in mAP based evaluation, (b) operating a video based object detector at the best frame based precision/recall value (high F1 score) may lead to many false positives without a significant increase in the number of objects detected. (c) mAP does not take into account that tracking can be potentially used to recover missed detections in the temporal neighborhood while this can be account for while evaluating detectors. As an alternate, we suggest a novel evaluation metric (VmAP) which takes the focus away from evaluating detections on every frame. Unlike mAP, VmAP rewards a high recall of different object views throughout the video. We form sets of bounding boxes having similar views of an object in a temporal neighborhood and use a set-level recall for evaluation. We show that VmAP is able to address all the challenges with the mAP listed above. Our experiments demonstrate hidden biases in object detectors, shows upto 99% reduction in false positives while maintaining similar object recall and shows a 9% improvement in correlation with post-tracking performance.

Abstract:
Deep\footnote learning-based semantic segmentation methods require a huge amount of training images with pixel-level annotations. Unsupervised domain adaptation (UDA) for semantic segmentation enables transferring knowledge learned from the synthetic data (source domain) with low-cost annotations to the real images (target domain). However, current UDA methods mostly require full access to the source domain data for feasible adaptation, which limits their applications in real-world scenarios with privacy, storage, or transmission issues. To this end, this paper identifies and addresses a more practical but challenging problem of UDA for semantic segmentation, where access to the original source domain data is forbidden. In other words, only the pre-trained source model and unlabelled target domain data are available for adaptation. To tackle the problem, we propose to construct a set of source domain virtual data to mimic the source domain distribution by identifying the target domain high-confidence samples predicted by the pre-trained source model. Then by analyzing the data properties in the cross-domain semantic segmentation tasks, we propose an uncertainty and prior distribution-aware domain adaptation method to align the virtual source domain and the target domain with both adversarial learning and self-training strategies. Extensive experiments on three cross-domain semantic segmentation datasets with in-depth analyses verify the effectiveness of the proposed method.

Abstract:
In this paper, we propose a novel pretext task namely Channel Aliasing Video Perception (CAVP) for self-supervised video representation learning. The main idea of our approach is to generate channel aliasing videos, which carry different motion cues simultaneously by assembling distinct channels from different videos. With the generated channel aliasing videos, we propose to recognize the number of different motion flows within a channel aliasing video for perception of discriminative motion cues. As a plug-and-play method, the proposed pretext task can be integrated into a co-training framework with other self-supervised learning methods to further improve the performance. Experimental results on publicly available action recognition benchmarks verify the effectiveness of our method for spatio-temporal representation learning.

Abstract:
The goal of multi-view subspace clustering is to explore a common latent space where the multi-view data points lying on. Myriads of subspace learning algorithms have been investigated to boost the performance of multi-view clustering, but seldom exploiting both the multi-view consistency and multi-view diversity, let alone taking them into consideration simultaneously. To do so, we lodge a novel multi-view subspace clustering via cross-view diversity detection (CDD). CDD is able to exploit these two complementary criteria seamlessly into a holistic design of clustering algorithms. With the consistent part and diverse part being detected, a pure graph can be derived for each view. The consistent pure parts of different views are further fused to a consensus structured graph with exactly k connected components where k is the number of clusters. Thus we can directly obtain the final clustering result without any postprocessing as each connected component precisely corresponds to an individual cluster. We model the above concerns into a unified optimization framework. Our empirical studies validate that the proposed model outperforms several other state-of-the-art methods.

Abstract:
More and more deep neural network models have been deployed in real-time video systems. However, it is proved that deep models are susceptible to the crafted adversarial examples. The adversarial examples are imperceptible and can make the normal deep models misclassify them. Although there exist a few works aiming at the adversarial examples of video recognition in the black-box attack mode, most of them need large perturbations or hundreds of thousands of queries. There are still lack of effective adversarial methods to produce adversarial videos with small perturbations and limited query numbers at the same time.

Abstract:
Fashion video synthesis has attracted increasing attention due to its huge potential in immersive media, virtual reality and online retail applications, yet traditional 3D graphic pipelines often require extensive manual labor on data capture and model rigging. In this paper, we investigate an image-based approach to this problem that generates a fashion video clip from a still source image of the desired outfit, which is then rigged in a framewise fashion under the guidance of a driving video. A key challenge for this task lies in the modeling of feature transformation across source and driving frames, where fine-grained transform helps promote visual details at garment regions, but often at the expense of intensified temporal flickering. To resolve this dilemma, we propose a novel framework with 1) a multi-scale transform estimation and feature fusion module to preserve fine-grained garment details, and 2) an intrinsic regularization loss to enforce temporal consistency of learned transform between adjacent frames. Our solution is capable of generating 512×512 fashion videos with rich garment details and smooth fabric movements beyond existing results. Extensive experiments over the FashionVideo benchmark dataset have demonstrated the superiority of the proposed framework over several competitive baselines.

Abstract:
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce annotated examples. Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component in the detector, yet few of them take the distinct preferences towards feature embedding of two subtasks into consideration. In this paper, we carefully analyze the characteristics of FSOD, and present that a few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations. To the end, we propose a simple yet effective Adaptive Fully-Dual Network (AFD-Net). Specifically, we extend Faster R-CNN by introducing Dual Query Encoder and Dual Attention Generator for separate feature extraction, and Dual Aggregator for separate model reweighting. In this way, separate state estimation is achieved by the R-CNN detector. Furthermore, we introduce Adaptive Fusion Mechanism to guide the design of encoders for efficient feature fusion in the specific subtask. Extensive experiments on PASCAL VOC and MS COCO show that our approach achieves state-of-the-art performance by a large margin, demonstrating its effectiveness and generalization ability.

Abstract:
The deep convolutional neural network has achieved significant progress for single image rain streak removal. However, most of the data-driven learning methods are full-supervised or semi-supervised, unexpectedly suffering from significant performance drop when dealing with the real rain. These data-driven learning methods are representative yet generalize poor for real rain. The opposite holds true for the model-driven unsupervised optimization methods. To overcome these problems, we propose a unified unsupervised learning framework which inherits the generalization and representation merits for real rain removal. Specifically, we first discover a simple yet important domain knowledge that directional rain streak is anisotropic while the natural clean image is isotropic, and formulate the structural discrepancy into the energy function of the optimization model. Consequently, we design an optimization model driven deep CNN in which the unsupervised loss function of the optimization model is enforced on the proposed network for better generalization. In addition, the architecture of the network mimics the main role of the optimization models with better feature representation. On one hand, we take advantage of the deep network to improve the representation. On the other hand, we utilize the unsupervised loss of the optimization model for better generalization. Overall, the unsupervised learning framework achieves good generalization and representation: unsupervised training (loss) with only a few real rainy images (input) and physical meaning network (architecture). Extensive experiments on synthetic and real-world rain datasets show the superiority of the proposed method.

Abstract:
To make video creation simpler, in this paper we present Text2Video, a novel system to automatically produce videos using only text-editing for novice users. Given an input text script, the director-like system can generate game-related engaging videos which illustrate the given narrative, provide diverse multi-modal content, and follow video editing guidelines. The system involves five modules: (1) A material manager extracts highlights from raw live game videos, and tags each video highlight, image and audio with labels. (2) A natural language processor extracts entities and semantics from the input text scripts. (3) A refined cross-modal retrieval searches for matching candidate shots from the material manager. (4) A text to speech speaker reads the processed text scripts with synthesized human voice. (5) The selected material shots and synthesized speech are assembled artistically through appropriate video editing techniques.

Abstract:
Telemarketing is a primary and mature method for enterprises to solicit prospective customers to buy products or services. However, training telesales representatives is always a pain point for enterprises since it is usually conducted manually and costs great effort and time. In this demonstration, we propose a telemarketing coaching system named SmartSales to help enterprises develop better salespeople. Powered by artificial intelligence (AI), SmartSales aims to accumulate the experienced sales pitch from customer-sales dialogues and use it to coach junior salespersons. To the best of our knowledge, this is the first practice of an AI telemarketing coaching system in the domain of Chinese FinTech in the literature. SmartSales has been successfully deployed in the WeBank's telemarketing team. We expect that SmartSales will inspire more research on AI assistant systems.

Abstract:
Question answering over tables is a very popular semantic parsing task in natural language processing (NLP). However, few existing methods focus on table images, even though there are usually large-scale unstructured tables in practice (e.g., table images). Table parsing from images is nontrivial since it is closely related to not only NLP but also computer vision (CV) to parse the tabular structure from an image. In this demo, we present a question answering system for unstructured table images. The proposed system mainly consists of 1) a table recognizer to recognize the tabular structure from an image and 2) a table parser to generate the answer to a natural language question over the table. In addition, to train the model, we further provide table images and structure annotations for two widely used semantic parsing datasets. Specifically, the test set is used for this demo, from where the users can either choose from default questions or enter a new custom question.

Abstract:
Diet management is usually conducted by recording the name of foods eaten, but in fact, the nutritional value of food in the same name varies greatly from recipe to recipe. To know accurate nutritional values of the foods, recording personal recipes is effective but time-consuming. Therefore, we are developing a mobile application "RecipeLog", that assists users to write their own recipes by modifying prepared ones. In our experiments, we show that with RecipeLog users create personal recipes with 45% less edit distance compared to writing from scratch.

Abstract:
ArtiVisual is a platform for generating new art-pieces based on an existing art style and comparing commonalities between paintings from different era. We combine an image generative network with established state-of-the-art visualisation techniques to deepen the users' understanding of art in general. With ArtiVisual we can generate images based on art- styles via an interactive timeline. Common features between art-styles are reflected on the generated art piece produced by the network after learning the subspace of each artist's specific features. Visualisations are presented to provide insight into commonalities between existing and generated images. The combination of a trained network and our visualisation techniques provides a rigid framework for thorough exploration and understanding of art datasets.

Abstract:
Although pedestrian detection has developed a lot recently, there still exists some challenging scenarios, such as small-scale, occlusion and low-light. Current works usually focus on one of these scenarios independently and propose specific methods. However, different challenges may happen at a time simultaneously and change across time, making a specific method infeasible in practice. Therefore we are motivated to design a method which is able to handle various challenges and to obtain reasonable performance across different scenarios. In this paper, we first propose Instance Domain Compactness (IDC) to measure the difference of each instance in the feature space and handle hard cases from a novel long-tailed domain perspective. Specifically, we first propose a Feature Augmentation Module (FAM) to augment the tail instances in the feature space, thereby increasing the number and diversity of tail samples. Besides, a IDC-guided loss weighting module (IDCW) is formulated to adaptively re-weight the loss of each sample so as to balance the optimization procedure. Extensive analysis and experiments illustrate that our method improves the generalization of the model without any extra parameters and achieves comparable results across different challenging scenarios on both CityPersons and Caltech datasets.

Abstract:
Effective contexts for separating shadows from non-shadow objects can appear in different scales due to different object sizes. This paper introduces a new module, Effective-Context Augmentation (ECA), to utilize these contexts for robust shadow detection with deep structures. Taking regular deep features as global references, ECA enhances the discriminative features from the parallelly computed fine-scale features and, therefore, obtains robust features embedded with effective object contexts by boosting them. We further propose a novel encoder-decoder style of shadow detection method where ECA acts as the main building block of the encoder to extract strong feature representations and the guidance to the classification process of the decoder. Moreover, the networks are optimized with only one loss, which is easy to train and does not have the instability caused by extra losses superimposed on the intermediate features among existing popular studies. Experimental results show that the proposed method can effectively eliminate fake detections. Especially, our method outperforms state-of-the-arts methods and improves over 13.97% and 34.67% on the challenging SBU and UCF datasets respectively in balance error rate.

Abstract:
Medical imaging has been critically important for the health and well-being of millions of patients. Although deep learning has been widely studied in medical imaging area and the performance of deep learning has exceeded human's performance in certain medical diagnostic tasks, detecting and diagnosing lesions still depends on the visual system of human observers (radiologists), who completed years of training to scrutinize anomalies. Routinely, radiologists sequentially read batches of medical images one after the other. A basic underlying assumption of radiologists' precise diagnosis is that their perceptions and decisions on a current medical image are completely independent from the previous reading history of medical images. However, recent research proposed that the human visual system has visual serial dependencies (VSDs) at many levels. VSD means that what was seen in the past influences (and captures) what is seen and reported at this moment. Our pilot data via naive artificial stimuli has shown that VSD has a disruptive effect in radiologic searches that impairs accurate detection and recognition of tumors or other structures. However, the naive artificial stimuli have been noted by both untrained observers and expert radiologists to be less authentic. In this project, we will generate authentic medical images via Generative Adversarial Networks (GANs) in order to replace the simple stimuli in future experiments. The rationale for the proposed research project is that once it is known how serial dependence arises and how it impacts visual search, we can understand how to control for it. Hence, the accuracy of diagnosis via medical imaging can significantly improve. The specific goals of this project are to establish, identify and mitigate the impact of VSD on visual search tasks in clinical settings.

Abstract:
Compared with tedious per-pixel mask annotating, it is much easier to annotate data by clicks, which costs only several seconds for an image. However, applying clicks to learn video semantic segmentation model has not been explored before. In this work, we propose an effective weakly-supervised video semantic segmentation pipeline with click annotations, called WeClick, for saving laborious annotating effort by segmenting an instance of the semantic class with only a single click. Since detailed semantic information is not captured by clicks, directly training with click labels leads to poor segmentation predictions. To mitigate this problem, we design a novel memory flow knowledge distillation strategy to exploit temporal information (named memory flow) in abundant unlabeled video frames, by distilling the neighboring predictions to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation for model compression. In this case, WeClick learns compact video semantic segmentation models with the low-cost click annotations during the training phase yet achieves real-time and accurate models during the inference period. Experimental results on Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods, increases performance by 10.24% mIoU than baseline, and achieves real-time execution.

Abstract:
Multi-person pose estimation is an attractive and challenging task. Existing methods are mostly based on two-stage frameworks, which include top-down and bottom-up methods. Two-stage methods either suffer from high computational redundancy for additional person detectors or they need to group keypoints heuristically after predicting all the instance-agnostic keypoints. The single-stage paradigm aims to simplify the multi-person pose estimation pipeline and receives a lot of attention. However, recent single-stage methods have the limitation of low performance due to the difficulty of regressing various full-body poses from a single feature vector. Different from previous solutions that involve complex heuristic designs, we present a simple yet effective solution by employing instance-aware dynamic networks. Specifically, we propose an instance-aware module to adaptively adjust (part of) the network parameters for each instance. Our solution can significantly increase the capacity and adaptive-ability of the network for recognizing various poses, while maintaining a compact end-to-end trainable pipeline. Extensive experiments on the MS-COCO dataset demonstrate that our method achieves significant improvement over existing single-stage methods, and makes a better balance of accuracy and efficiency compared to the state-of-the-art two-stage approaches.

Abstract:
The growing application of face images and modern AI technology has raised another important concern in privacy protection. In many real scenarios like scientific research, social sharing and commercial application, lots of images are released without privacy processing to protect people's identity. In this paper, we develop a novel effective de-identification generative adversarial network (DeIdGAN) for face anonymization by seamlessly replacing a given face image with a different synthesized yet realistic one. Our approach consists of two steps. First, we anonymize the input face to obfuscate its original identity. Then, we use our designed de-identification generator to synthesize an anonymized face. During the training process, we leverage a pair of identity-adversarial discriminators to explicitly constrain identity protection by pushing the synthesized face away from the predefined sensitive faces to resist re-identification and identity invasion. Finally, we validate the effectiveness of our approach on public datasets. Compared with existing methods, our approach can not only achieve better identity protection rates but also preserve superior image quality and data reusability, which suggests the state-of-the-art performance.

Abstract:
Skeleton-based action recognition has been widely investigated considering their strong adaptability to dynamic circumstances and complicated backgrounds. To recognize different actions from skeleton sequences, it is essential and crucial to model the posture of the human represented by the skeleton and its changes in the temporal dimension. However, most of the existing works treat skeleton sequences in the temporal and spatial dimension in the same way, ignoring the difference between the temporal and spatial dimension in skeleton data which is not an optimal way to model skeleton sequences. The posture represented by the skeleton in each frame is proposed to be modeled individually. Meanwhile, capturing the movement of the entire skeleton in the temporal dimension is needed. So, we designed Spatial Transformer Block and Directional Temporal Transformer Block for modeling skeleton sequences in spatial and temporal dimensions respectively. Due to occlusion/sensor/raw video, etc., there are noises on both temporal and spatial dimensions in the extracted skeleton data reducing the recognition capabilities of models. To adapt to this imperfect information condition, we propose a multi-task self-supervised learning method by providing confusing samples in different situations to improve the robustness of our model. Combining the above design, we propose our Spatial-Temporal Specialized Transformer~(STST) and conduct experiments with our model on the SHREC, NTU-RGB+D, and Kinetics-Skeleton. Extensive experimental results demonstrate the improved performances and analysis of the proposed method.

Abstract:
Point cloud is often noisy and incomplete. Existing completion methods usually generate the complete shapes for missing regions of 3D objects based on the deterministic learning frameworks, which only predict a single reconstruction output. However, these methods ignore the ill-posed nature of the completion problem and do not fully account for multiple possible completion predictions corresponding to one incomplete input. To address this problem, we propose a flow-based network together with a multi-modal mapping strategy for 3D point cloud completion. Specially, an encoder is first introduced to encode the input point cloud data into a rich latent representation suitable for conditioning in all flow-layers. Then we design a conditional normalizing flow architecture to learn the exact distribution of the plausible completion shapes over the multi-modal latent space. Finally, in order to fully utilize additional shape information, we propose a tree-structured decoder to perform the inverse mapping for complete shape generation with high fidelity. The proposed flow network is trained using a single loss named the negative log-likelihood to capture the distribution variations between input and output, without complex reconstruction loss and adversarial loss. Extensive experiments on ShapeNet dataset, KITTI dataset and measured data demonstrate that our method outperforms the state-of-the-art point cloud completion methods through qualitative and quantitative analysis.

Abstract:
Creating a cohesive, high-quality, relevant, media story is a challenge that news media editors face on a daily basis. This challenge is aggravated by the flood of highly-relevant information that is constantly pouring onto the newsroom. To assist news media editors in this daunting task, this paper proposes a framework to organize news content into cohesive, high-quality, relevant visual storylines. First, we formalize, in a nonsubjective manner, the concept of visual story transition. Leveraging it, we propose four graph based methods of storyline creation, aiming for global story cohesiveness. These where created and implemented to take full advantage of existing graph algorithms, ensuring their correctness and good computational performance. They leverage a strong ensemble-based estimator which was trained to predict story transition quality based on both the semantic and visual features present in the pair of images under scrutiny. A user study covered a total of 28 curated stories about sports and cultural events. Experiments showed that (i) visual transitions in storylines can be learned with a quality above 90%, and (ii) the proposed graph methods can produce cohesive storylines with a quality in the range of 88% to 96%.

Abstract:
The recently emerged weakly-supervised object localization (WSOL) methods can learn to localize an object in the image only using image-level labels. Previous works endeavor to perceive the interval objects from the small and sparse discriminative attention map, yet ignoring the co-occurrence confounder (e.g., duck and water), which makes the model inspection (e.g., CAM) hard to distinguish between the object and context. In this paper, we make an early attempt to tackle this challenge via causal intervention (CI). Our proposed method, dubbed CI-CAM, explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps thus improving the accuracy of object localization. Extensive experiments on several benchmarks demonstrate the effectiveness of CI-CAM in learning the clear object boundary from confounding contexts. Particularly, on the CUB-200-2011 which severely suffers from the co-occurrence confounder, CI-CAM significantly outperforms the traditional CAM-based baseline (58.39% vs 52.4% in Top-1 localization accuracy). While in more general scenarios such as ILSVRC 2016, CI-CAM can also perform on par with the state of the arts.

Abstract:
Conventional machine learning models are often vulnerable to samples with different distributions from the ones of training samples, which is known as domain shift. Domain Generalization (DG) challenges this issue by training a model based on multiple source domains and generalizing it to arbitrary unseen target domains. In spite of remarkable results made in DG, a majority of existing works lack a deep understanding of the feature representations learned in DG models, resulting in limited generalization ability when facing domainsout-of-distribution. In this paper, we aim to learn a domain transformation space via a domain transformer network (DTN) which explicitly mines the relationship among multiple domains and constructs transferable feature representations for down-stream tasks by interpreting each feature as a semantically weighted combination of multiple domain-specific features. Our DTN is encouraged to meta-learn the properties and characteristics of domains during the training process based on multiple seen domains, making transformed feature representations more semantical, thus generalizing better to unseen domains. Once the model is constructed, the feature representations of unseen target domains can also be inferred adaptively by selectively combining the feature representations from the diverse set of seen domains. We conduct extensive experiments on five DG benchmarks and the results strongly demonstrate the effectiveness of our approach.

Abstract:
Despite recent progress on image-based virtual try-on, current methods are constraint by shared warping networks and thus fail to synthesize natural try-on results when faced with clothing categories that require different warping operations. In this paper, we address this problem by finding clothing category-specific warping networks for the virtual try-on task via Neural Architecture Search (NAS). We introduce a NAS-Warping Module and elaborately design a bilevel hierarchical search space to identify the optimal network-level and operation-level flow estimation architecture. Given the network-level search space, containing different numbers of warping blocks, and the operation-level search space with different convolution operations, we jointly learn a combination of repeatable warping cells and convolution operations specifically for the clothing-person alignment. Moreover, a NAS-Fusion Module is proposed to synthesize more natural final try-on results, which is realized by leveraging particular skip connections to produce better-fused features that are required for seamlessly fusing the warped clothing and the unchanged person part. We adopt an efficient and stable one-shot searching strategy to search the above two modules. Extensive experiments demonstrate that our WAS-VTON significantly outperforms the previous fixed-architecture try-on methods with more natural warping results and virtual try-on results.

Abstract:
In face detection, it is a common strategy to treat samples differently according to their difficulty for balancing training data distribution. However, we observe that widely used sampling strategies, such as OHEM and Focal loss, can lead to the performance imbalance between different tasks (e.g., classification and localization). Through analysis, we point out that, due to the driving of classification information, these sample-based strategies are difficult to coordinate the attention of different tasks during the training, thus leading to the above imbalance. Accordingly, we first confirm this by shifting the attention from the sample level to the task level. Then, we propose a fine-grained task attention method, a.k.a FTA, including inter-task importance and intra-task importance, which adaptively adjusts the attention of each item in the task from both global and local perspectives, so as to achieve finer optimization. In addition, we introduce transformer as a feature enhancer to assist our convolution network, and propose a context enhancement transformer, a.k.a CET, to mine the spatial relationship in the features towards more robust feature representation. Extensive experiments on WiderFace and FDDB benchmarks demonstrate that our method significantly boosts the baseline performance by 2.7%, 2.3% and 4.9% on easy, medium and hard validation sets respectively. Furthermore, the proposed FTAFace-light achieves higher accuracy than the state-of-the-art and reduces the amount of computation by 28.9%.

Abstract:
In this companion paper, we provide the details of the reproducibility artifacts of the paper "Visual Relation of Interest Detection" presented at MM'20. Visual Relation of Interest Detection (VROID) aims to detect visual relations that are important for conveying the main content of an image. In this paper, we explain the file structure of the source code and publish the details of our ViROI dataset, which can be used to retrain the model with custom parameters. We also detail the scripts for component analysis and comparison with other methods and list the parameters that can be modified for custom training and inference.

Abstract:
With the rapid development of digital technologies such as VR, AR, XR, and more importantly the almost ubiquitous mobile broadband coverage, we are entering an Integrated Physical-Digital World (IPhD), the tight integration of virtual world with the physical world. The IPhD is characterized with four key technologies: Virtualization of the physical world, Realization of the virtual world, Holographic internet, and Intelligent Agent. Internet will continue its development with faster speed and broader bandwidth, and will eventually be able to communicate holographic contents including 3D shape, appearance, spatial audio, touch sensing and smell. Intelligent agents, such as digital human, and digital/physical robots, travels between digital and physical worlds. In this talk, we will describe our work on digital human for this IPhD world. This includes: computer vision techniques for building digital humans, multimodal text-to-speech synthesis (voice and lip shapes), speech-driven face animation, neural-network-based body motion control, human-digital-human interaction, and an emotional video game anchor.

Abstract:
A classifier trained on one dataset rarely works on other datasets obtained under different conditions because of domain shifting. Such a problem is usually solved by domain adaptation methods. In this paper, we propose a novel unsupervised domain adaptation (UDA) method based on Interchangeable Batch Normalization (InterBN) to fuse different channels in deep neural networks for adversarial domain adaptation.Specifically, we first observe that the channels with small batch normalization scaling factor have less influence on the whole domain adaption, followed by a theoretical proof that the scaling factors for some channels will definitely come close to zero when imposing a sparsity regularization. Then, we replace the channels that have smaller scaling factors in the source domain with the mean of the channels which have larger scaling factors in the target domain or vice versa. Such a simple but effective channel fusion scheme can drastically increase the domain adaption ability.Extensive experimental results show that our InterBN significantly outperforms the current adversarial domain adaptation methods by a large margin on four visual benchmarks. In particular, InterBN achieves a remarkable improvement of 7.7% over the conditional adversarial adaptation networks (CDAN) on VisDA-2017 benchmark.

Abstract:
Emotion plays a critical role in calligraphy composition, which makes the calligraphy artwork impressive and have a soul. However, previous research on calligraphy generation all neglected the emotion as a major contributor to the artistry of calligraphy. Such defects prevent them from generating aesthetic, stylistic, and diverse calligraphy artworks, but only static handwriting font library instead. To address this problem, we propose a novel cross-modal approach to generate stylistic and diverse Chinese calligraphy artwork driven by different emotions automatically. We firstly detect the emotions in the text by a classifier, then generate the emotional Chinese character images via a novel modified Generative Adversarial Network (GAN) structure, finally we predict the layout for all character images with a recurrent neural network. We also collect a large-scale stylistic Chinese calligraphy image dataset with rich emotions. Experimental results demonstrate that our model outperforms all baseline image translation models significantly for different emotional styles in terms of content accuracy and style discrepancy. Besides, our layout algorithm can also learn the patterns and habits of calligrapher, and makes the generated calligraphy more artistic. To the best of our knowledge, we are the first to work on emotion-driven discourse-level Chinese calligraphy artwork composition.

Abstract:
Pose-guided person image synthesis aims to synthesize person images by transforming reference images into target poses. In this paper, we observe that the commonly used spatial transformation blocks have complementary advantages. We propose a novel model by combining the attention operation with the flow-based operation. Our model not only takes the advantage of the attention operation to generate accurate target structures but also uses the flow-based operation to sample realistic source textures. Both objective and subjective experiments demonstrate the superiority of our model. Meanwhile, comprehensive ablation studies verify our hypotheses and show the efficacy of the proposed modules. Besides, additional experiments on the portrait image editing task demonstrate the versatility of the proposed combination.

Abstract:
Estimating human pose is an important yet challenging task in multimedia applications. Existing pose estimation libraries target reproducing standard pose estimation algorithms. When it comes to customising these algorithms for real-world applications, none of the existing libraries can offer both the flexibility of developing custom pose estimation algorithms and the high-performance of executing these algorithms on commodity devices. In this paper, we introduce Hyperpose, a novel flexible and high-performance pose estimation library. Hyperpose provides expressive Python APIs that enable developers to easily customise pose estimation algorithms for their applications. It further provides a model inference engine highly optimised for real-time pose estimation. This engine can dynamically dispatch carefully designed pose estimation tasks to CPUs and GPUs, thus automatically achieving high utilisation of hardware resources irrespective of deployment environments. Extensive evaluation results show that Hyperpose can achieve up to 3.1x~7.3x higher pose estimation throughput compared to state-of-the-art pose estimation libraries without compromising estimation accuracy. By 2021, Hyperpose has received over 1000 stars on GitHub and attracted users from both industry and academy.

Abstract:
We introduce tf_geometric1, an efficient and friendly library for graph deep learning, which is compatible with both TensorFlow 1.x and 2.x. It provides kernel libraries for building Graph Neural Networks (GNNs) as well as implementations of popular GNNs. The kernel libraries consist of infrastructures for building efficient GNNs, including graph data structures, graph map-reduce framework, graph mini-batch strategy, etc. These infrastructures enable tf_geometric to support single-graph computation, multi-graph computation, graph mini-batch, distributed training, etc.; therefore, tf_geometric can be used for a variety of graph deep learning tasks, such as node classification, link prediction, and graph classification. Based on the kernel libraries, tf_geometric implements a variety of popular GNN models. To facilitate the implementation of GNNs, tf_geometric also provides some other libraries for dataset management, graph sampling, etc. Different from existing popular GNN libraries, tf_geometric provides not only Object-Oriented Programming (OOP) APIs, but also Functional APIs, which enable tf_geometric to handle advanced tasks such as graph meta-learning. The APIs are friendly and suitable for both beginners and experts.

Abstract:
We introduce PyTorchVideo, an open-source deep-learning library that provides a rich set of modular, efficient, and reproducible components for a variety of video understanding tasks, including classification, detection, self-supervised learning, and low-level processing. The library covers a full stack of video understanding tools including multimodal data loading, transformations, and models that reproduce state-of-the-art performance. PyTorchVideo further supports hardware acceleration that enables real-time inference on mobile devices. The library is based on PyTorch and can be used by any training framework; for example, PyTorchLightning, PySlowFast, or Classy Vision. PyTorchVideo is available at https://pytorchvideo.org/.

Abstract:
Domain adaptive image retrieval (DAR) aims to train the model with well-labeled source domain and target images in order to retrieve source instances given query target samples from the identical category space. However, the practical scenario hinders to manually annotate all retrieved images due to huge labeling cost. Motivated by the realistic demand, we firstly define the semi-supervised domain adaptive retrieval (SDAR) problem, assuming the database includes a small proportion annotated source images and abundant unlabeled ones. To overcome the challenging SDAR, this paper propose a novel method named Discriminative Hashing learning (DHLing) which mainly includes two modules, i.e., domain-specific optimization and domain-invariant memory bank. Specifically, the first component explores the structural knowledge of samples to predict the unlabeled images with pseudo labels to achieve hash coding consistency. While, the second one attempts to construct the domain-invariant memory bank to guide the feature generation and achieve cross-domain alignment. Experimental results on several popular cross-domain retrieval benchmarks illustrate the effectiveness of our proposed DHLing on both conventional DAR and new SDAR scenarios by comparing with the state-of-the-art retrieval methods.

Abstract:
There is a soaring interest in the news recommendation research scenario due to the information overload. To accurately capture users' interests, we propose to model multi-modal features, in addition to the news titles that are widely used in existing works, for news recommendation. Besides, existing research pays little attention to the click decision-making process in designing multi-modal modeling modules. In this work, inspired by the fact that users make their click decisions mostly based on the visual impression they perceive when browsing news, we propose to capture such visual impression information with visual-semantic modeling for news recommendation. In this paper, we refer to visual impression as the region of the news displayed on the user interface of a news application, which delivers both content and layout information to users. Specifically, we devise the local impression modeling module to simultaneously attend to decomposed details in the impression when understanding the semantic meaning of news title, which could explicitly get close to the process of users reading news. In addition, we inspect the impression from a global view and take structural information, such as the arrangement of different fields and spatial position of different words on the impression, into the modeling of multiple modalities. To accommodate the research of visual impression-aware news recommendation, we extend the text-dominated news recommendation dataset MIND by adding snapshot impression images and will release it to nourish the research field. Extensive comparisons with the state-of-the-art news recommenders along with the in-depth analyses demonstrate the effectiveness of the proposed method and the promising capability of modeling visual impressions for the content-based recommenders.

Abstract:
With the popularity of using computer vision technology in monitoring system, there is an increasing societal concern on intruding people's privacy as the captured images/videos may contain identity-related information e.g. people's face. Existing methods on protecting such privacy focus on removing the identity-related information from faces. However, this would weaken the utility of current monitoring system. In this paper, we develop a face anonymization framework that could obfuscate visual appearance while preserving the identity discriminability. The framework is composed of two parts: an identity-aware region discovery module and an identity-aware face confusion module. The former adaptively locates the identity-independent attributes on human faces, and the latter generates the privacy-preserving faces using original faces and discovered facial attributes. To optimize the face generator, we employ a multi-task based loss function, which consists of discriminator loss, identify preserving loss, and reconstruction loss functions. Our model can achieve a balance between recognition utility and appearance anonymizing by modifying different numbers of facial attributes according to pratical demands, and provide a variety of results. Extensive experiments conducted on two public benchmarks Celeb-A and VGG-Face2 demonstrate the effectiveness of our model under distinct face recognition scenarios.

Abstract:
Heterogeneous information networks (HINs) have become a popular tool to capture complicated user-item relationships in recommendation problems in recent years. As a typical instantiation of HINs, meta-path is introduced in search of higher-level representations of user-item interactions. Though remarkable success has been achieved along this direction, existing meta-path-based recommendation methods face at least one of the following issues: 1) existing methods merely adopt simple meta-path fusion rules, which might be insufficient to exclude inconsistent information of different meta-paths that may hurt model performance; 2) the representative power is limited by shallow/stage-wise formulations. To solve these issues, we propose an end-to-end and unified embedding-based recommendation framework with graph-based learning. To address 1), we propose a flexible fusion module to integrate meta-path-based similarities into relative similarities between users and items. To address 2), we take advantage of the powerful representative ability of deep neural networks to learn more complicated and flexible latent embeddings. Finally, empirical studies on real-world datasets demonstrate the effectiveness of our proposed method.

Abstract:
The 360-degree video (omnidirectional video) has become popular recently due to its capability of providing immersive experience, which is generally achieved via spherical moving pictures with freedom of viewpoint changing. Nevertheless, the support of full-view visual contents has inevitably reshaped its perceptual quality metric and dramatically increased its bitrate output after video coding. Therefore in 360-degree video coding, the Rate Control (RC) problem, which aims to maximize the resulted perceptual quality under bitrate constraint, has become a challenging task yet to be addressed. In this paper, we observe a latitude-based bitrate discrepancy in equirectangular-projected 360-degree video coding and further utilize this feature in bitrate allocation under panoramic vision. We introduce game theory to find optimal inter/intra-frame bit allocations that maximize the overall RC performance in terms of utility function. Finally, an overall framework is proposed that is capable of providing both an improved bitrate accuracy and an enhanced perceptual quality. Experimental results demonstrate the efficiency of proposed method, with promising RC performances for 4K and 8K 360-degree videos.

Abstract:
Guided depth super-resolution is a practical task where a low-resolution and noisy input depth map is restored to a high-resolution version, with the help of a high-resolution RGB guide image. Existing methods usually view this task as a generalized guided filtering problem that relies on designing explicit filters and objective functions, or a dense regression problem that directly predicts the target image via deep neural networks. These methods suffer from either model capability or interpretability. Inspired by the recent progress in implicit neural representation, we propose to formulate the guided super-resolution as a neural implicit image interpolation problem, where we take the form of a general image interpolation but use a novel Joint Implicit Image Function (JIIF) representation to learn both the interpolation weights and values. JIIF represents the target image domain with spatially distributed local latent codes extracted from the input image and the guide image, and uses a graph attention mechanism to learn the interpolation weights at the same time in one unified deep implicit function. We demonstrate the effectiveness of our JIIF representation on guided depth super-resolution task, significantly outperforming state-of-the-art methods on three public benchmarks. Code can be found at https://git.io/JC2sU

Abstract:
360° videos a.k.a. spherical videos are getting popular among users nevertheless, omnidirectional view of these videos demands high bandwidth and processing power at the end devices. Recently proposed viewport aware streaming mechanisms can reduce the amount of data transmitted by streaming a limited portion of the frame covering the current user viewport (VP). However, they still suffer from sending a high amount of redundant data, as the fixed tile mechanisms can not provide a finer granularity to the user VP. Though, making the tiles smaller can provide a finer granularity for user viewport, it will significantly increase encoding-decoding overhead. To overcome this trade-off, in this paper, we present a computational geometric approach based adaptive tiling mechanism named VASTile, which takes visual attention information on a 360° video frame as the input and provides a suitable non-overlapping variable size tile cover on the frame. Experimental results show that VASTile can save up to 31.1% of pixel redundancy before compression and 35.4% of bandwidth saving compared to recently proposed fixed tile configurations, providing tile schemes within 0.98 (±0.11) seconds time frame.

Abstract:
GPUs have become ubiquitous in the Cloud-rendering areas due to the outstanding rendering performance. However, many existing Cloud-rendering systems suffer from low GPU utilization caused by the CPU bottleneck. Recent proposals (e.g., API-forwarding and c-GPU) for GPU-usage optimization are promising but fail to address the system-resource redundancy issues (i.e., each instance tends to occupy all the system resources exceeding their requirements), leading to unnecessary CPU consumption and lowering GPU utilization. We conducted an experiment by testing real-world applications on the percentage of unused resources to demonstrate the severity of this issue. Nearly 50% of resources are unused.

Abstract:
Three-degrees-of-freedom (3-DoF) omnidirectional imaging has been widely used in various applications ranging from street maps to 3-DoF VR live broadcasting. Although allowing for navigating viewpoints rotationally inside a virtual world, it does not provide motion parallax key for human 3D perception. Recent research mitigates this problem by introducing 3 transitional degrees of freedom (6-DoF) using multi-sphere images (MSI) which is beginning to show promises in handling occlusions and reflective objects. However, the design of MSI naturally limits the range of authentic 6-DoF experiences, as existing mechanisms for MSI rendering cannot fully utilize multi-layer information when synthesizing novel views between multiple MSIs. To tackle this problem and extend the 6-DoF range, we propose an MSI interpolation pipeline that utilizes adjacent MSIs' 3D information embedded inside their layers. In this work, we describe an MSI projection scheme along with an MSI interpolation network to predict intermediate MSIs in order to facilitate the need for extended range. We demonstrate that our system significantly improves the range of 6-DoF experience compared with other MSI-based methods. With extensive experiments, we show our algorithm outperforms state-of-the-art methods both qualitatively and quantitatively in synthesizing novel view panoramas.

Abstract:
4D reconstruction of human-object interaction is critical for immersive VR/AR experience and human activity understanding. Recent advances still fail to recover fine geometry and texture results from sparse RGB inputs, especially under challenging human-object interactions scenarios. In this paper, we propose a neural human performance capture and rendering system to generate both high-quality geometry and photo-realistic texture of both human and objects under challenging interaction scenarios in arbitrary novel views, from only sparse RGB streams. To deal with complex occlusions raised by human-object interactions, we adopt a layer-wise scene decoupling strategy and perform volumetric reconstruction and neural rendering of the human and object. Specifically, for geometry reconstruction, we propose an interaction-aware human-object capture scheme that jointly considers the human reconstruction and object reconstruction with their correlations. Occlusion-aware human reconstruction and robust human-aware object tracking are proposed for consistent 4D human-object dynamic reconstruction. For neural texture rendering, we propose a layer-wise human-object rendering scheme, which combines direction-aware neural blending weight learning and spatial-temporal texture completion to provide high-resolution and photo-realistic texture results in the occluded scenarios. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality geometry and texture reconstruction in free viewpoints for challenging human-object interactions.

Abstract:
This paper presents the 1st place solution to the Grand Challenge of ACM MM2021 Robust Logo Detection. We build our end-to-end solution on top of Cascade RCNN (using Res2Net101 as backbone). Through careful observation during training, we find that the model performance is limited by imbalanced gradients from different classes of the long-tailed dataset. We adopt a gradient balancing approach to tackle this problem. Our approach reweighs the gradients of each class to guide the training process towards a balance between all classes. Moreover, we design a series of data augmentation policies and propose a progressive data augmentation strategy to train our model to deal with adversarial samples. We demonstrate the accuracy and robustness of our method by achieving 70.2448 mAP on leaderboard A, and 63.8793 mAP on leaderboard B, which contains adversarial images.

Abstract:
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions, such as presentation form, scene, and style. Different from real-life videos, video advertisements contain sufficient and useful multi-modal content like caption and speech, which provides crucial video semantics and would enhance the structuring process. In this paper, we propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text. Based on multi-modal representation, we then apply Boundary-Matching Network to generate temporal proposals. To make the proposals more accurate, we refine generated proposals by scene-guided alignment and re-ranking. Finally, we incorporate proposal located embeddings into the introduced multi-modal encoder to capture temporal relationships between local features of each proposal and global features of the whole video for classification. Experimental results show that our method achieves significantly improvement compared with several baselines and Rank 1 on the task of Multi-modal Ads Video Understanding in ACM Multimedia 2021 Grand Challenge. Ablation study further shows that leveraging multi-modal content like caption and speech in video advertisements significantly improve the performance.

Abstract:
In this paper, we present our solution to the Multi-modal Ads Video Tagging Challenge of Tencent Advertising Algorithm Competition in ACM Multimedia 2021 Grand Challenges. We extend the baseline model by redesigning the visual feature extraction procedure and we modify the loss function to cope with sparse positive targets. Moreover, we propose Semi-supervised Learning with Negative Masking to leverage both labeled data and unlabeled data from the preliminary contest which effectively enhances the training process. We further utilize Cross-Class Relevance Learning to boost the performance. We achieve 0.8237 GAP score via model ensemble and rank the second place among all submissions in the challenge.

Abstract:
Artificial mediators are promising to support human group conversations but at present their abilities are limited by insufficient progress in group behaviour analysis. The MultiMediate challenge addresses, for the first time, two fundamental group behaviour analysis tasks in well-defined conditions: eye contact detection and next speaker prediction. For training and evaluation, MultiMediate makes use of the MPIIGroup Interaction dataset consisting of 22 three- to four-person discussions as well as of an unpublished test set of six additional discussions. This paper describes the MultiMediate challenge and presents the challenge dataset including novel fine-grained speaking annotations that were collected for the purpose of MultiMediate. Furthermore, we present baseline approaches and ablation studies for both challenge tasks

Abstract:
Previous mainstream video analysis methods, especially 3D CNNs-based models, mainly aim to transfer frameworks from the image domain to the video domain, and they follow the regime which has been succeeded in image processing, i.e., large-scale benchmarks and deep networks. However, processing videos is still time-consuming due to the increased computational cost. In this paper, we propose to flat the video and construct a Spatio-temporal Image (STI), i.e., squeezing the temporal dimension into a spatial plane. To pursuit the video-level modeling and efficient architecture, we devise a Collective Convolution (CoConv) operation to replace the 2D convolution. With the holistic sampling strategy, this novel operation can extract the video-level spatio-temporal representation. Moreover, we ensure that each CoConv operation has the same number of parameters as the original 2D filter, thus we can utilize a 2D network equipped with CoConv to analyze videos without additional computations. To verify the effectiveness of our method for the general video analysis, we evaluate it on three typical tasks, i.e., supervised action recognition, self-supervised action recognition, and dynamic texture recognition. Extensive experimental results show that our method can achieve comparable or state-of-the-art performances on these benchmarks while using much fewer computations compared with its 3D counterpart.

Abstract:
Human motion understanding and prediction is an integral aspect in our pursuit of machine intelligence and human-machine interaction systems. Current methods typically pursue a kinematics modeling approach, relying heavily upon prior anatomical knowledge and constraints. However, such an approach is hard to generalize to different skeletal model representations, and also tends to be inadequate in accounting for the dynamic range and complexity of motion, thus hindering predictive accuracy. In this work, we propose a novel approach in modeling the motion prediction problem based on stochastic differential equations and path integrals. The motion profile of each skeletal joint is formulated as a basic stochastic variable and modeled with the Langevin equation. We develop a strategy of employing GANs to simulate path integrals that amounts to optimizing over possible future paths. We conduct experiments in two large benchmark datasets, Human 3.6M and CMU MoCap. It is highlighted that our approach achieves a 12.48% accuracy improvement over current state-of-the-art methods in average.

Abstract:
Learning delicate feature representation of object parts plays a critical role in fine-grained visual classification tasks. However, advanced deep convolutional neural networks trained for general visual classification tasks usually tend to focus on the coarse-grained information while ignoring the fine-grained one, which is of great significance for learning discriminative representation. In this work, we explore the great merit of multi-modal data in introducing semantic knowledge and sequential analysis techniques in learning hierarchical feature representation for generating discriminative fine-grained features. To this end, we propose a novel approach, termed Channel Cusum Attention ResNet (CCA-ResNet ), for multi-modal joint learning of fine-grained representation. Specifically, we use feature-level multi-modal alignment to connect image and text classification models for joint multi-modal training. Through joint training, image classification models trained with semantic level labels tend to focus on the most discriminative parts, which enhances the cognitive ability of the model. Then, we propose a Channel Cusum Attention (CCA ) mechanism to equip feature maps with hierarchical properties through unsupervised reconstruction of local and global features. The benefits brought by the CCA are in two folds: a) allowing fine-grained features from early layers to be preserved in the forward propagation of deep networks; b) leveraging the hierarchical properties to facilitate multi-modal feature alignment. We conduct extensive experiments to verify that our proposed model can achieve state-of-the-art performance on a series of fine-grained visual classification benchmarks.

Abstract:
Sign language translation aims at directly translating a sign language video into a natural sentence. The majority of existing methods take the video-sentence pairs labeled by multiple specific signers as training and testing samples. However, such setting does not fit in with the real-world applications. A practicable sign language translation system is supposed to provide accurate translation results for unseen signers. In this paper, we mainly attack the signer-independent setting and focus on augmenting the generalization ability of translation model. To adapt to the challenging setting, we propose a novel framework called contrastive disentangled meta-learning (CDM), which develops several improvements in both deep architecture and training mode. Specifically, based on the minimax entropy objective, a disentangled module with adaptive gated units is developed to decouple the signer-specific and task-specific representation in the encoder. Besides, we facilitate the frame-word alignments by leveraging contrastive constraints between the obtained task-specific representation and the decoding output. The disentangled and contrastive modules could provide complementary information for each other. As for the training mode, we encourage the model to perform well in the simulated signer-independent scenarios by finding the generalized learning directions in the meta-learning process. Considering that vanilla meta-learning methods utilize the multiple specific signers insufficiently, we adopt a fine-grained learning strategy that simultaneously conducts meta-learning in a variety of domain shift scenarios in each iteration. Extensive experiments on the benchmark dataset RWTH-PHOENIX-Weather-2014T(PHOENIX14T) show that CDM could achieve competitive results compared with the state-of-the-art methods.

Abstract:
Glasses removal is a challenging task due to the diversity of glasses species and the difficulty of obtaining paired datasets. Most existing methods need to build different models for different glasses or expensive paired datasets for supervised training, which lacks universality. In this paper, we propose a multimodal asymmetric dual learning method for unsupervised glasses removal. This method uses large-scale face images with and without glasses for dual feature learning, which does not require intensive manual marking of the glasses. Given a face image with glasses, we aim to generate a glasses-free image preserving the person identity. Thus, in order to make up for the lack of semantic features in the glasses region, we introduce the text description of the target image into the task, and propose a text-guided multimodal feature fusion method. We adaptively select the glasses-free image closest to the target one for better dual feature learning. We also propose a exchange residual loss to generate more precise mask of glasses. Extensive experiments prove that our method can generate real glasses-free images, and better retain the person identity, which can be useful for face recognition.

Abstract:
Video Question Answering (VideoQA) is a challenging problem, as it requires a joint understanding of video and natural language question. Existing methods perform correlation learning between video and question have achieved great success. However, previous methods merely model relations between individual video frames (or clips) and words, which are not enough to correctly answer the question. From human's perspective, answering a video question should first summarize both visual and language information, and then explore their correlations for answer reasoning. In this paper, we propose a new method called Pairwise VLAD Interaction Network (PVI-Net) to address this problem. Specifically, we develop a learnable clustering-based VLAD encoder to respectively summarize video and question modalities into a small number of compact VLAD descriptors. For correlation learning, a pairwise VLAD interaction mechanism is proposed to better exploit complementary information for each pair of modality descriptors, avoiding modeling uninformative individual relations (e.g., frame-word and clip-word relations), and exploring both inter- and intra-modality relations simultaneously. Experimental results show that our approach achieves state-of-the-art performance on three VideoQA datasets: TGIF-QA, MSVD-QA, and MSRVTT-QA.

Abstract:
In this paper, we propose a one-stage approach to improve referring expression comprehension (REC) which aims at grounding the referent according to a natural language expression. We observe that humans understand referring expressions through a fine-to-coarse bottom-up way, and bidirectionally obtain vision-language information between image and text. Inspired by this, we define the language granularity and the vision granularity. Otherwise, existing methods do not follow the mentioned way of human understanding in referring expression. Motivated by our observation and to address the limitations of existing methods, we propose a bottom-up and bidirectional alignment (BBA) framework. Our method constructs the cross-modal alignment starting from fine-grained representation to coarse-grained representation and bidirectionally obtains vision-language information between image and text. Based on the structure of BBA, we further propose a progressive visual attribute decomposing approach to decompose visual proposals into several independent spaces to enhance the bottom-up alignment framework. Experiments on five benchmark datasets of RefCOCO, RefCOCO+, ReferItGame, RefCOCOg and Flick30K show that our approach obtains +2.16%, +4.47%, +2.85%, +3.44%, and +2.91% improvements over the one-stage SOTA approaches, which validates the effectiveness of our approach.

Abstract:
The challenge in fine-grained visual classification (FGVC) is that the similarity within intra-class may be larger than inter-class, where the discriminative details require more attention than traditional classification tasks. To generate channel-wise complementary and discriminative features in beneficial details of FGVC, we propose a multi-branch channel-wise enhancement network (MCEN), which includes multi-pattern spatial disruption mechanism, inter-channel complementarity module(ICM), and novel soft target loss. The raw images are scrambled in multi-pattern and then the sub-images with different degrees of confusion are combined into three pairs as inputs, where the scrambled operation can force the channel to look for the discriminative details. And ICM can measure the complementarity between key features and overall features to restrain the redundancy of features. The soft target loss is designed for classification and the semantic relationship between the blocks is learned to judge the degree of the chaos of the image. Our designed multi-branched structure utilizes the shallow visual and deep semantic features to judge the outcome jointly, where the image pairs obtained by segmentation and rearrangement are input into the different branches to extract more complementary features from different patterns of the raw image. Our method is trained end-to-end with only class labels. Experimental results show that our model outperforms the state-of-the-art performance on three fine-grained benchmarks.

Abstract:
For few-shot learning, minimizing the empirical risk cannot reach the optimal hypothesis from image to its label due to the effect of overfitting. Therefore, most of the existing work leverages a set of base classes with sufficient labeled samples to pre-train a general encoder for feature representation, which is then applied for all few-shot classification tasks without considering the uniqueness of the target task. We suppose that different base classes help solve a target task in varying degrees, and some classes even introduce a negative effect. To this end, we propose a Target-guided Base Class Reweighting (TBR) approach, which uses a reweighting-in-the-loop optimization algorithm to assign a set of weights for base classes adaptively given a target task. Specifically, TBR learns the parameter of the encoder via minimizing weighted empirical risk on base class data, then optimizes the weights according to the the encoder's performance on support set of the target task. Such an alternating optimization procedure brings reweighting into the loop which makes the encoder more sensitive to the novel classes of the target task. Extensive experiments demonstrate that the proposed method can improve the performance of model-based approaches on two few-shot classification benchmarks.

Abstract:
Recently, deep graph matching (GM) methods have gained increasing attention. These methods integrate graph nodes¡¯s embedding, node/edges¡¯s affinity learning and final correspondence solver together in an end-to-end manner. For deep graph matching problem, one main issue is how to generate consensus node's embeddings for both source and target graphs that best serve graph matching tasks. In addition, it is also challenging to incorporate the discrete one-to-one matching constraints into the differentiable correspondence solver in deep matching network. To address these issues, we propose a novel Graph Adversarial Matching Network (GAMnet) for graph matching problem. GAMnet integrates graph adversarial embedding and graph matching simultaneously in a unified end-to-end network which aims to adaptively learn distribution consistent and domain invariant embeddings for GM tasks. Also, GAMnet exploits sparse GM optimization as correspondence solver which is differentiable and can also incorporate discrete one-to-one matching constraints approximately in natural in the final matching prediction. Experimental results on three public benchmarks demonstrate the effectiveness and benefits of the proposed GAMnet.

Abstract:
Video super-resolution (VSR) and video frame interpolation (VFI) are inter-dependent for enhancing videos of low resolution and low frame rate. However, most studies treat VSR and temporal VFI as independent tasks. In this work, we design a spatial-temporal super-resolution network based on exploring the interaction between VSR and VFI. The main idea is to improve the middle frame of VFI by the super-resolution (SR) frames and feature maps from VSR. In the meantime, VFI also provides extra information for VSR and thus, through interacting, the SR of consecutive frames of the original video can also be improved by the feedback from the generated middle frame. Drawing on this, our approach leverages a simple interaction of VSR and VFI and achieves state-of-the-art performance on various datasets. Due to such a simple strategy, our approach is universally applicable to any existing VSR or VFI networks for effectively improving their video enhancement performance.

Abstract:
Super-resolution (SR) is a well-studied technique for reconstructing high-resolution (HR) images from low-resolution (LR) ones. SR holds great promise for video streaming since an LR video segment can be transmitted from the video server to the client that then reconstructs the HR version using SR, resulting in a significant reduction in network bandwidth. However, SR is seldom used in practice for real-time video streaming, because the computational overhead of frame reconstruction results in large latency and low frame rate.

Abstract:
Objective evaluation (OE) is essential to artificial music, but it's often very hard to determine the quality of OEs. Hitherto, subjective evaluation (SE) remains reliable and prevailing but suffers inevitable disadvantages that OEs may overcome. Therefore, a meta-evaluation system is necessary for designers to test the effectiveness of OEs. In this paper, we present Armor, a complex and cross-domain benchmark dataset that serves this purpose. Since OEs should correlate with human judgment, we provide music as test cases for OEs and human judgment scores as touchstones. We also provide two meta-evaluation scenarios and their corresponding testing methods to assess the effectiveness of OEs. To the best of our knowledge, Armor is the first comprehensive and rigorous framework that future works could follow, take example by, and improve upon for the task of evaluating computer-generated music and the field of computational music as a whole. By analyzing different OE methods on our dataset, we observe that there is still a huge gap between SE and OE, meaning that hard-coded algorithms are far from catching human's judgment to the music.

Abstract:
The Second International Workshop on Human-centric Multimedia Analysis is focused on human-centric analysis using multimedia information. The human-centric multimedia analysis is one of the fundamental and challenging problems of multimedia understanding. It involves various human-centric analysis tasks like face recognition, human pose estimation, person re-identification, human action recognition, person tracking, human-computer interaction, etc. Nowadays, various multimedia sensing devices and large-scale computing infrastructures are generating a wide variety of multi-modality data at a rapid velocity, which supplies rich knowledge to tackle these challenges for human-centric analysis. Researchers and engineers have strived to push the limits of human-centric multimedia analysis in a wide variety of applications, such as smart city, retailing, intelligent manufacturing, and public services. To this end, our workshop aims to provide a platform to promote exchanges and integration for the fields of human analysis and multimedia.

Abstract:
Domain generalization aims to enhance the model robustness against domain shift without accessing the target domain. Since the available source domains for training are limited, recent approaches focus on generating samples of novel domains. Nevertheless, they either struggle with the optimization problem when synthesizing abundant domains or cause the distortion of class semantics. To these ends, we propose a novel domain generalization framework where feature statistics are utilized for stylizing original features to ones with novel domain properties. To preserve class information during stylization, we first decompose features into high and low frequency components. Afterward, we stylize the low frequency components with the novel domain styles sampled from the manipulated statistics, while preserving the shape cues in high frequency ones. As the final step, we re-merge both the components to synthesize novel domain features. To enhance domain robustness, we utilize the stylized features to maintain the model consistency in terms of features as well as outputs. We achieve the feature consistency with the proposed domain-aware supervised contrastive loss, which ensures domain invariance while increasing class discriminability. Experimental results demonstrate the effectiveness of the proposed feature stylization and the domain-aware contrastive loss. Through quantitative comparisons, we verify the lead of our method upon existing state-of-the-art methods on two benchmarks, PACS and Office-Home.

Abstract:
Significant progress has been made in high-resolution and photo-realistic image generation by Generative Adversarial Networks (GANs). However, the generation process is still lack of control, which is crucial for semantic face editing. Furthermore, it remains challenging to edit target attributes and preserve the identity at the same time. In this paper, we propose SSFlow to achieve identity-preserved semantic face manipulation in StyleGAN latent space based on conditional Neural Spline Flows. To further improve the performance of Neural Spline Flows on such task, we also propose Constractive Squash component and Blockwise 1 x 1 Convolution layer. Moreover, unlike other conditional flow-based approaches that require facial attribute labels during inference, our method can achieve label-free manipulation in a more flexible way. As a result, our methods are able to perform well-disentangled edits along various attributes, and generalize well for both real and artistic face image manipulation. Qualitative and quantitative evaluations show the advantages of our method for semantic face manipulation over state-of-the-art approaches.

Abstract:
The goal of few-shot fine-grained image classification is to recognize rarely seen fine-grained objects in the query set, given only a few samples of this class in the support set. Previous works focus on learning discriminative image features from a limited number of training samples for distinguishing various fine-grained classes, but ignore one important fact that spatial alignment of the discriminative semantic features between the query image with arbitrary changes and the support image, is also critical for computing the semantic similarity between each support-query pair. In this work, we propose an object-aware long-short-range spatial alignment approach, which is composed of a foreground object feature enhancement (FOE) module, a long-range semantic correspondence (LSC) module and a short-range spatial manipulation (SSM) module. The FOE is developed to weaken background disturbance and encourage higher foreground object response. To address the problem of long-range object feature misalignment between support-query image pairs, the LSC is proposed to learn the transferable long-range semantic correspondence by a designed feature similarity metric. Further, the SSM module is developed to refine the transformed support feature after the long-range step to align short-range misaligned features (or local details) with the query features. Extensive experiments have been conducted on four benchmark datasets, and the results show superior performance over most state-of-the-art methods under both 1-shot and 5-shot classification scenarios.

Abstract:
Many previous methods on text-based person retrieval tasks are devoted to learning a latent common space mapping, with the purpose of extracting modality-invariant features from both visual and textual modality. Nevertheless, due to the complexity of high-dimensional data, the unconstrained mapping paradigms are not able to properly catch discriminative clues about the corresponding person while drop the misaligned information. Intuitively, the information contained in visual data can be divided into person information (PI) and surroundings information (SI), which are mutually exclusive from each other. To this end, we propose a novel Deep Surroundings-person Separation Learning (DSSL) model in this paper to effectively extract and match person information, and hence achieve a superior retrieval accuracy. A surroundings-person separation and fusion mechanism plays the key role to realize an accurate and effective surroundings-person separation under a mutually exclusion constraint. In order to adequately utilize multi-modal and multi-granular information for a higher retrieval accuracy, five diverse alignment paradigms are adopted. Extensive experiments are carried out to evaluate the proposed DSSL on CUHK-PEDES, which is currently the only accessible dataset for text-base person retrieval task. DSSL achieves the state-of-the-art performance on CUHK-PEDES. To properly evaluate our proposed DSSL in the real scenarios, a Real Scenarios Text-based Person Reidentification (RSTPReid) dataset is constructed to benefit future research on text-based person retrieval, which will be publicly available.

Abstract:
In this paper, we consider the task of action anticipation on egocentric videos. Previous methods ignore explicit modeling of the global context relation among past and future actions, which is not an easy task due to the vacancy of unobserved videos. To solve this problem, we propose a Multimodal Global Relation Knowledge Distillation (MGRKD) framework to distill the knowledge learned from full videos to improve the action anticipation task on partially observed videos. The proposed MGRKD has a teacher-student learning strategy, where either the teacher or student model has three branches of global relation graph networks (GRGN) to explore the pairwise relations between past and future actions based on three kinds of features (i.e., RGB, motion or object). The teacher model has a similar architecture with the student model, except that the teacher model uses true feature of the future video snippet to build the graph in GRGN while the student model uses a progressive GRU to predict an initialized node feature of future snippet in GRGN. Through the teacher-student learning strategy, the discriminative features and relation knowledge of the past and future actions learned in the teacher model can be distilled to the student model. The experiments on two egocentric video datasets EPIC-Kitchens and EGTEA Gaze+ show that the proposed framework achieves state-of-the-art performances.

Abstract:
The current research on adversarial attacks aims at a single model while the research on attacking multiple models simultaneously is still challenging. In this paper, we propose a novel black-box attack method, referred to as MBbA, which can attack multiple black-boxes at the same time. By encoding input image and its target category into an associated space, each decoder seeks the appropriate attack areas from the image through the designed loss functions, and then generates effective adversarial examples. This process realizes end-to-end adversarial example generation without involving substitute models for the black-box scenario. On the other hand, adopting the adversarial examples generated by MBbA for adversarial training, the robustness of the attacked models are greatly improved. More importantly, those adversarial examples can achieve satisfactory attack performance, even if these black-box models are trained with the adversarial examples generated by other black-box attack methods, which show good transferability. Finally, extensive experiments show that compared with other state-of-the-art methods: (1) MBbA takes the least time to obtain the most effective attack effects in multi-black-box attack scenario. Furthermore, MBbA achieves the highest attack success rates in a single black-box attack scenario; (2) the adversarial examples generated by MBbA can effectively improve the robustness of the attacked models and exhibit good transferability.

Abstract:
Whether an outfit is compatible? Using machine learning methods to assess an outfit's compatibility, namely, fashion compatibility modeling (FCM), has recently become a popular yet challenging topic. However, current FCM studies still perform far from satisfactory, because they only consider the collocation compatibility modeling, while neglecting the natural human habits that people generally evaluate outfit compatibility from both the collocation (discrete assess) and the try-on (unified assess) perspectives. In light of the above analysis, we propose a Collocation and Try-On Network (CTO-Net) for FCM, combining both the collocation and try-on compatibilities. In particular, for the collocation perspective, we devise a disentangled graph learning scheme, where the collocation compatibility is disentangled into multiple fine-grained compatibilities between items; regarding the try-on perspective, we propose an integrated distillation learning scheme to unify all item information in the whole outfit to evaluate the compatibility based on the latent try-on representation. To further enhance the collocation and try-on compatibilities, we exploit the mutual learning strategy to obtain a more comprehensive judgment. Extensive experiments on the real-world dataset demonstrate that our CTO-Net significantly outperforms the state-of-the-art methods. In particular, compared with the competitive counterparts, our proposed CTO-Net significantly improves AUC accuracy from 83.2% to 87.8% and MRR from 15.4% to 21.8%. We have released our source codes and trained models to benefit other researchers.1

Abstract:
Graph convolutional networks have significantly improved 3D human pose estimation by representing the human skeleton as an undirected graph. However, this representation fails to reflect the articulated characteristic of human skeletons as the hierarchical orders among the joints are not explicitly presented. In this paper, we propose to represent the human skeleton as a directed graph with the joints as nodes and bones as edges that are directed from parent joints to child joints. By so doing, the directions of edges can explicitly reflect the hierarchical relationships among the nodes. Based on this representation, we further propose a spatial-temporal conditional directed graph convolution to leverage varying non-local dependence for different poses by conditioning the graph topology on input poses. Altogether, we form a U-shaped network, named U-shaped Conditional Directed Graph Convolutional Network, for 3D human pose estimation from monocular videos. To evaluate the effectiveness of our method, we conducted extensive experiments on two challenging large-scale benchmarks: Human3.6M and MPI-INF-3DHP. Both quantitative and qualitative results show that our method achieves top performance. Also, ablation studies show that directed graphs can better exploit the hierarchy of articulated human skeletons than undirected graphs, and the conditional connections can yield adaptive graph topologies for different poses.

Abstract:
This paper investigates a valuable setting called few-shot unsupervised domain adaptation (FS-UDA), which has not been sufficiently studied in the literature. In this setting, the source domain data are labelled, but with few-shot per category, while the target domain data are unlabelled. To address the FS-UDA setting, we develop a general UDA model to solve the following two key issues: the few-shot labeled data per category and the domain adaptation between support and query sets. Our model is general in that once trained it will be able to be applied to various FS-UDA tasks from the same source and target domains. Inspired by the recent local descriptor based few-shot learning (FSL), our general UDA model is fully built upon local descriptors (LDs) for image classification and domain adaptation. By proposing a novel concept called similarity patterns (SPs), our model not only effectively considers the spatial relationship of LDs that was ignored in previous FSL methods, but also makes the learned image similarity better serve the required domain alignment. Specifically, we propose a novel IMage-to-class sparse Similarity Encoding (IMSE) method. It learns SPs to extract the local discriminative information for classification and meanwhile aligns the covariance matrix of the SPs for domain adaptation. Also, domain adversarial training and multi-scale local feature matching are performed upon LDs. Extensive experiments conducted on a multi-domain benchmark dataset DomainNet demonstrates the state-of-the-art performance of our IMSE for the novel setting of FS-UDA. In addition, for FSL, our IMSE can also show better performance than most of recent FSL methods on miniImageNet.

Abstract:
One of the main challenges in facial expression recognition (FER) is to address the disturbance caused by various disturbing factors, including common ones (such as identity, pose, and illumination) and potential ones (such as hairstyle, accessory, and occlusion). Recently, a number of FER methods have been developed to explicitly or implicitly alleviate the disturbance involved in facial images. However, these methods either consider only a few common disturbing factors or neglect the prior information of these disturbing factors, thus resulting in inferior recognition performance. In this paper, we propose a novel Dual-branch Disturbance Disentangling Network (D3Net), mainly consisting of an expression branch and a disturbance branch, to perform effective FER. In the disturbance branch, a label-aware sub-branch (LAS) and a label-free sub-branch (LFS) are elaborately designed to cope with different types of disturbing factors. On the one hand, LAS explicitly captures the disturbance due to some common disturbing factors by transfer learning on a pretrained model. On the other hand, LFS implicitly encodes the information of potential disturbing factors in an unsupervised manner. In particular, we introduce an Indian buffet process (IBP) prior to model the distribution of potential disturbing factors in LFS. Moreover, we leverage adversarial training to increase the differences between disturbance features and expression features, thereby enhancing the disentanglement of disturbing factors. By disentangling the disturbance from facial images, we are able to extract discriminative expression features. Extensive experiments demonstrate that our proposed method performs favorably against several state-of-the-art FER methods on both in-the-lab and in-the-wild databases.

Abstract:
With the ubiquity of sensor-equipped smartphones, it is common to have multimedia documents uploaded to the Internet that have GPS coordinates associated with them. Utilizing such geotags as an additional feature is intuitively appealing for improving the performance of location-aware applications. However, raw GPS coordinates are fine-grained location indicators without any semantic information. Existing methods on geotag semantic encoding mostly extract hand-crafted, application-specific location representations that heavily depend on large-scale supplementary data and thus cannot perform efficiently on mobile devices. In this paper, we present a machine learning based approach, termed GPS2Vec+, which learns rich location representations by capitalizing on the world-wide geotagged images. Once trained, the model has no dependence on the auxiliary data anymore so it encodes geotags highly efficiently by inference. We extract visual and semantic knowledge from image content and user-generated tags, and transfer the information into locations by using geotagged images as a bridge. To adapt to different application domains, we further present an attention-based fusion framework that estimates the importance of the learnt location representations under different contexts for effective feature fusion. Our location representations yield significant performance improvements over the state-of-the-art geotag encoding methods on image classification and venue annotation.

Abstract:
Image manipulation with StyleGAN has been an increasing concern in recent years. Recent works have achieved tremendous success in analyzing several semantic latent spaces to edit the attributes of the generated images. However, due to the limited semantic and spatial manipulation precision in these latent spaces, the existing endeavors are defeated in fine-grained StyleGAN image manipulation, i.e., local attribute translation. To address this issue, we discover attribute-specific control units, which consist of multiple channels of feature maps and modulation styles. Specifically, we collaboratively manipulate the modulation style channels and feature maps in control units rather than individual ones to obtain the semantic and spatial disentangled controls. Furthermore, we propose a simple yet effective method to detect the attribute-specific control units. We move the modulation style along a specific sparse direction vector and replace the filter-wise styles used to compute the feature maps to manipulate these control units. We evaluate our proposed method in various face attribute manipulation tasks. Extensive qualitative and quantitative results demonstrate that our proposed method performs favorably against the state-of-the-art methods. The manipulation results of real images further show the effectiveness of our method.

Abstract:
Multimodal emotion recognition has long been a popular topic in affective computing since it significantly enhances the performance compared with that of a single modality. Among all, the combination of electroencephalography (EEG) and eye movement signals is one of the most attractive practices due to their complementarity and objectivity. However, the high cost and inconvenience of EEG signal acquisition severely hamper the popularization of multimodal emotion recognition in practical scenarios, while eye movement signals are much easier to acquire. To increase the feasibility and the generalization ability of emotion decoding without compromising the performance, we propose a generative adversarial network-based framework. In our model, a single modality of eye movements is used as input and it is capable of mapping the information onto multimodal features. Experimental results on SEED series datasets with different emotion categories demonstrate that our model with multimodal features generated by the single eye movement modality maintains competitive accuracies compared to those with multimodality input and drastically outperforms those single-modal emotion classifiers. This illustrates that the model has the potential to reduce the dependence on multimodalities without sacrificing performance which makes emotion recognition more applicable and practicable.

Abstract:
Recognizing human emotions from videos has attracted significant attention in numerous computer vision and multimedia applications, such as human-computer interaction and health care. It aims to understand the emotional response of humans, where candidate emotion categories are generally defined by specific psychological theories. However, with the development of psychological theories, emotion categories become increasingly diverse and fine-grained, samples are also increasingly difficult to collect. In this paper, we investigate a new task of zero-shot video emotion recognition, which aims to recognize rare unseen emotions. Specifically, we propose a novel multimodal protagonist-aware transformer network, which is composed of two branches: one is equipped with a novel dynamic emotional attention mechanism and a visual transformer to learn better visual representations; the other is an acoustic transformer for learning discriminative acoustic representations. We manage to align the visual and acoustic representations with semantic embeddings of fine-grained emotion labels through jointly mapping them into a common space under a noise contrastive estimation objective. Extensive experimental results on three datasets demonstrate the effectiveness of the proposed method.

Abstract:
Virtual try-on technology enables users to try various fashion items using augmented reality and provides a convenient online shopping experience. However, most previous works focus on the virtual try-on for clothes while neglecting that for shoes, which is also a promising task. To this concern, this work proposes a real-time augmented reality virtual shoe try-on system for smartphones, namely ARShoe. Specifically, ARShoe adopts a novel multi-branch network to realize pose estimation and segmentation simultaneously. A solution to generate realistic 3D shoe model occlusion during the try-on process is presented. To achieve a smooth and stable try-on effect, this work further develop a novel stabilization method. Moreover, for training and evaluation, we construct the very first large-scale foot benchmark with multiple virtual shoe try-on task-related labels annotated. Exhaustive experiments on our newly constructed benchmark demonstrate the satisfying performance of ARShoe. Practical tests on common smartphones validate the real-time performance and stabilization of the proposed approach.

Abstract:
We study the joint learning of image-to-text and text-to-image generations, which are naturally bi-directional tasks. Typical existing works design two separate task-specific models for each task, which impose expensive design efforts. In this work, we propose a unified image-and-text generative framework based on a single multimodal model to jointly study the bi-directional tasks. We adopt Transformer as our unified architecture for its strong performance and task-agnostic design. Specifically, we formulate both tasks as sequence generation tasks, where we represent images and text as unified sequences of tokens, and the Transformer learns multimodal interactions to generate sequences. We further propose two-level granularity feature representations and sequence-level training to improve the Transformer-based unified framework. Experiments show that our approach significantly improves previous Transformer-based model X-LXMERT's FID from 37.0 to 29.9 (lower is better) for text-to-image generation, and improves CIDEr-D score from 100.9% to 122.6% for fine-tuned image-to-text generation on the MS-COCO dataset. Our code is available online.

Abstract:
Media is evolving from traditional linear narratives to personalised experiences, where control over information (or how it is presented) is given to individual audience members. Measuring and understanding audience engagement with this media is important in at least two ways: (1) a post-hoc understanding of how engaged audiences are with the content will help production teams learn from experience and improve future productions; (2), this type of media has potential for real-time measures of engagement to be used to enhance the user experience by adapting content on-the-fly. Engagement is typically measured by asking samples of users to self-report, which is time consuming and expensive. In some domains, however, interaction data have been used to infer engagement. Fortuitously, the nature of interactive media facilitates a much richer set of interaction data than traditional media; our research aims to understand if these data can be used to infer audience engagement. In this paper, we report a study using data captured from audience interactions with an interactive TV show to model and predict engagement. We find that temporal metrics, including overall time spent on the experience and the interval between events, are predictive of engagement. The results demonstrate that interaction data can be used to infer users' engagement during and after an experience, and the proposed techniques are relevant to better understand audience preference and responses.

Abstract:
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression. The diverse and flexible expressions and complex visual contents in the images raise the RIS model with higher demands for investigating fine-grained matching behaviors between words in expressions and objects presented in images. However, such matching behaviors are hard to be learned and captured when the visual cues of referents (i.e. referred objects) are insufficient, as the referents of weak visual cues tend to be easily confused by cluttered background at boundary or even overwhelmed by salient objects in the image. And the insufficient visual cues issue can not be handled by the cross-modal fusion mechanisms as done in previous work.In this paper, we tackle this problem from a novel perspective of enhancing the visual information for the referents by devising a Two-stage Visual cues enhancement Network (TV-Net), where a novel Retrieval and Enrichment Scheme (RES) and an Adaptive Multi-resolution feature Fusion (AMF) module are proposed. Specifically, RES retrieves the most relevant image from an external data pool with regard to both the visual and textual similarities, and then enriches the visual information of the referent with the retrieved image for better multimodal feature learning. AMF further enhances the visual detailed information by incorporating the high-resolution feature maps from lower convolution layers of the image. Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image, especially when the visual information of the referent is inadequate, thus produces better segmentation results. Extensive experiments are conducted to validate the effectiveness of the proposed method on the RIS task, with our proposed TV-Net surpassing the state-of-the-art approaches on four benchmark datasets.

Abstract:
Lip reading, aiming to recognize spoken sentences according to the given video of lip movements without relying on the audio stream, has attracted great interest due to its application in many scenarios. Although prior works that explore lip reading have obtained salient achievements, they are all trained in a non-simultaneous manner where the predictions are generated requiring access to the full video. To breakthrough this constraint, we study the task of simultaneous lip reading and devise SimulLR, a simultaneous lip Reading transducer with attention-guided adaptive memory from three aspects: (1) To address the challenge of monotonic alignments while considering the syntactic structure of the generated sentences under simultaneous setting, we build a transducer-based model and design several effective training strategies including CTC pre-training, model warm-up and curriculum learning to promote the training of the lip reading transducer. (2) To learn better spatio-temporal representations for simultaneous encoder, we construct a truncated 3D convolution and time-restricted self-attention layer to perform the frame-to-frame interaction within a video segment containing fixed number of frames. (3) The history information is always limited due to the storage in real-time scenarios, especially for massive video data. Therefore, we devise a novel attention-guided adaptive memory to organize semantic information of history segments and enhance the visual representations with acceptable computation-aware latency. The experiments show that the SimulLR achieves the translation speedup 9.10x compared with the state-of-the-art non-simultaneous methods, and also obtains competitive results, which indicates the effectiveness of our proposed methods.

Abstract:
Video moment retrieval aims to localize the most relevant video moment given the text query. Weakly supervised approaches leverage video-text pairs only for training, without temporal annotations. Most current methods align the proposed video moment and the text in a joint embedding space. However, in lack of temporal annotations, the semantic gap between these two modalities makes it predominant to learn joint feature representation for most methods, with less emphasis on learning visual feature representation. This paper aims to improve the visual feature representation with supervisions in the visual domain, obtaining discriminative visual features for cross-modal learning. Based on the observation that relevant video moments (i.e., share similar activities) from different videos are commonly described by similar sentences; hence the visual features of these relevant video moments should also be similar despite that they come from different videos. Therefore, to obtain more discriminative and robust visual features for video moment retrieval, we propose to align the visual features of relevant video moments from different videos that co-occurred in the same training batch. Besides, a contrastive learning approach is introduced for learning the moment-level alignment of these videos. Through extensive experiments, we demonstrate that the proposed visual co-occurrence alignment learning method outperforms the cross-modal alignment learning counterpart and achieves promising results for video moment retrieval.

Abstract:
As a challenging task, unsupervised person ReID aims to match the same identity with query images which does not require any labeled information. In general, most existing approaches focus on the visual cues only, leaving potentially valuable auxiliary metadata information (e.g., spatio-temporal context) unexplored. In the real world, such metadata is normally available alongside captured images, and thus plays an important role in separating several hard ReID matches. With this motivation in mind, we propose MGH, a novel unsupervised person ReID approach that uses meta information to construct a hypergraph for feature learning and label refinement. In principle, the hypergraph is composed of camera-topology-aware hyperedges, which can model the heterogeneous data correlations across cameras. Taking advantage of label propagation on the hypergraph, the proposed approach is able to effectively refine the ReID results, such as correcting the wrong labels or smoothing the noisy labels. Given the refined results, we further present a memory-based listwise loss to directly optimize the average precision in an approximate manner. Extensive experiments on three benchmarks demonstrate the effectiveness of the proposed approach against the state-of-the-art.

Abstract:
Given input images, scene graph generation (SGG) aims to produce comprehensive, graphical representations describing visual relationships among salient objects. Recently, more efforts have been paid to the long tail problem in SGG; however, the imbalance in the fraction of missing labels of different classes, or reporting bias, exacerbating the long tail is rarely considered and cannot be solved by the existing debiasing methods. In this paper we show that, due to the missing labels, SGG can be viewed as a "Learning from Positive and Unlabeled data" (PU learning) problem, where the reporting bias can be removed by recovering the unbiased probabilities from the biased ones by utilizing label frequencies, i.e., the per-class fraction of labeled, positive examples in all the positive examples. To obtain accurate label frequency estimates, we propose Dynamic Label Frequency Estimation (DLFE) to take advantage of training-time data augmentation and average over multiple training iterations to introduce more valid examples. Extensive experiments show that DLFE is more effective in estimating label frequencies than a naive variant of the traditional estimate, and DLFE significantly alleviates the long tail and achieves state-of-the-art debiasing performance on the VG dataset. We also show qualitatively that SGG models with DLFE produce prominently more balanced and unbiased scene graphs. The source code is publicly available.

Abstract:
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.

Abstract:
In this paper, we propose a novel data augmentation strategy named Cut-Thumbnail, that aims to improve the shape bias of the network. We reduce an image to a certain size and replace the random region of the original image with the reduced image. The generated image not only retains most of the original image information but also has global information in the reduced image. We call the reduced image as thumbnail. Furthermore, we find that the idea of thumbnail can be perfectly integrated with Mixed Sample Data Augmentation, so we put one image's thumbnail on another image while the ground truth labels are also mixed, making great achievements on various computer vision tasks. Extensive experiments show that Cut-Thumbnail works better than state-of-the-art augmentation strategies across classification, fine-grained image classification, and object detection. On ImageNet classification, ResNet-50 architecture with our method achieves 79.21% accuracy, which is more than 2.8% improvement on the baseline.

Abstract:
This paper proposes a neural network that performs audio transformations to user-specified sources (e.g., vocals) of a given audio track according to a given description while preserving other sources not mentioned in the description. Audio Manipulation on a Specific Source (AMSS) is challenging because a sound object (i.e., a waveform sample or frequency bin) is 'transparent'; it usually carries information from multiple sources, in contrast to a pixel in an image. To address this challenging problem, we propose AMSS-Net, which extracts latent sources and selectively manipulates them while preserving irrelevant sources. We also propose an evaluation benchmark for several AMSS tasks, and we show that AMSS-Net outperforms baselines on several AMSS tasks via objective metrics and empirical verification.

Abstract:
Recent advances in deep learning bring impressive performance for multimedia applications. Hence, compressing and deploying these applications on resource-limited edge devices via model compression becomes attractive. Knowledge distillation (KD) is one of the most popular model compression techniques. However, most well-behaved KD approaches require the original dataset, which is usually unavailable due to privacy issues, while existing data-free KD methods perform much worse than data-required counterparts. In this paper, we analyze previous data-free KD methods from the data perspective and point out that using a single pre-trained model limits the performance of these approaches. We then propose a Data-Free Ensemble knowledge Distillation (DFED) framework, which contains a student network, a generator network, and multiple pre-trained teacher networks. During training, the student mimics behaviors of the ensemble of teachers using samples synthesized by a generator, which aims to enlarge the prediction discrepancy between the student and teachers. A moment matching loss term assists the generator training by minimizing the distance between activations of synthesized samples and real samples. We evaluate DFED on three popular image classification datasets. Results demonstrate that our method achieves significant performance improvements compared with previous works. We also design an ablation study to verify the effectiveness of each component of the proposed framework.

Abstract:
Including olfactory cues in virtual reality (VR) would enhance user immersion in the virtual environment, and precise control of smell would facilitate a more realistic experience for users. In this paper, we present aBio, an active bi-olfactory display system that delivers scents precisely to specific locations rather than diffusing scented air into the atmosphere. aBio provides users with a natural olfactory experience in free air by colliding two vortex rings launched from dual speaker-based vortex generators, which also has the effect of cushioning the force of air impact. According to the various requests of different applications, the collision point of the vortex rings can be positioned anywhere in front of the user's nose. To verify the effectiveness of our device and understand user sensations when using different parameters in our system, we conduct a series of experiments and user studies. The results show that the proposed system is effective in the sense that users perceive smell without sensible haptic disturbance while the system consumes only a very small amount of fragrant essential oil. We believe that aBio has great potential for increasing the level of presence in VR by delivering smells with high efficiency.

Abstract:
No-reference video quality assessment has not been widely benefited from deep learning, mainly due to the complexity, diversity and particularity of modelling spatial and temporal characteristics in quality assessment scenario. Image quality assessment (IQA) performed on video frames plays a key role in NR-VQA. A perceptual hierarchical network (PHIQNet) with an integrated attention module is first proposed that can appropriately simulate the visual mechanisms of contrast sensitivity and selective attention in IQA. Subsequently, perceptual quality features of video frames derived from PHIQNet are fed into a long short-term convolutional Transformer (LSCT) architecture to predict the perceived video quality. LSCT consists of CNN formulating quality features in video frames within short-term units that are then fed into Transformer to capture the long-range dependence and attention allocation over temporal units. Such architecture is in line with the intrinsic properties of VQA. Experimental results on publicly available video quality databases have demonstrated that the LSCT architecture based on PHIQNet significantly outperforms state-of-the-art video quality models.

Abstract:
In vision-language retrieval systems, users provide natural language feedback to find target images. Vision-language explanations in the systems can better guide users to provide feedback and thus improve the retrieval. However, developing explainable vision-language retrieval systems can be challenging, due to limited labeled multimodal data. In the retrieval of complex scenes, the issue of limited labeled data can be more severe. With multiple objects in the complex scenes, each user query may not exhaustively describe all objects in the desired image and thus more labeled queries are needed. The issue of limited labeled data can cause data selection biases, and result in spurious correlations learned by the models. When learning spurious correlations, existing explainable models may not be able to accurately extract regions from images and keywords from user queries.

Abstract:
Modern deep neural network models tend to be large and computationally intensive. One typical solution to this issue is model pruning. However, most current model pruning algorithms depend on hand crafted rules or need to input the pruning ratio beforehand. To overcome this problem, we propose a learning based automatic channel pruning algorithm for deep neural network, which is inspired by recent automatic machine learning (Auto ML). A two objectives' pruning problem that aims for the weights and the remaining channels for each layer is first formulated. An alternative optimization approach is then proposed to derive the channel numbers and weights simultaneously. In the process of pruning, we utilize a searchable hyper-parameter, remaining ratio, to denote the number of channels in each convolution layer, and then a dynamic masking process is proposed to describe the corresponding channel evolution. To adjust the trade-off between accuracy of a model and the pruning ratio of floating point operations, a new loss function is further introduced. Extensive experimental results on benchmark datasets demonstrate that our scheme achieves competitive results for neural network pruning.

Abstract:
Existing Few Shot Segmentation (FS-Seg) methods mostly study a restricted setting where only foreground and background are required to be discriminated and fall short at discriminating multiple classes. In this paper, we focus on a challenging but more practical variant: Generalized Few Shot Segmentation (GFS-Seg), where all SEEN and UNSEEN classes are segmented simultaneously. Previous methods treat the background as a regular class, leading to difficulty in differentiating UNSEEN classes from it at the test stage. To address this issue, we propose Adaptive Background Modeling and Prototype Query Network (ABPNet), in which the background is formulated as the complement of the set of interested classes. With the help of the attention mechanism and a novel meta-training strategy, it learns an effective set difference function that predicts task-specific background adaptively. Furthermore, we design a Prototype Querying (PQ) module that effectively transfers the learned knowledge to UNSEEN classes with a neural dictionary. Experimental results demonstrate that ABPNet significantly outperforms the state-of-the-art method CAPL on PASCAL-5i and COCO-20i, especially on UNSEEN classes. Also, without retraining, ABPNet can generalize well to FS-Seg.

Abstract:
Works on scene text visual question answering (TextVQA) always emphasize the importance of reasoning questions and image contents. However, we find current TextVQA models lack reasoning ability and tend to answer questions by exploiting dataset bias and language priors. Moreover, our observations indicate that recent accuracy improvement in TextVQA is mainly contributed by stronger OCR engines, better pre-training strategies and more Transformer layers, instead of newly proposed networks. In this work, towards the reasoning ability, we 1) conduct module-wise contribution analysis to quantitatively investigate how existing works improve accuracies in TextVQA; 2) design a gradient-based explainability method to explore why TextVQA models answer what they answer and find evidence for their predictions; 3) perform qualitative experiments to visually analyze models reasoning ability and explore potential reasons behind such a poor ability.

Abstract:
Recently proposed fine-grained 3D visual grounding is an essential and challenging task, whose goal is to identify the 3D object referred by a natural language sentence from other distractive objects of the same category. Existing works usually adopt dynamic graph networks to indirectly model the intra/inter-modal interactions, making the model difficult to distinguish the referred object from distractors due to the monolithic representations of visual and linguistic contents. In this work, we exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data and propose a TransRefer3D network to extract entity-and-relation aware multimodal context among objects for more discriminative feature learning. Concretely, we devise an Entity-aware Attention (EA) module and a Relation-aware Attention (RA) module to conduct fine-grained cross-modal feature matching. Facilitated by co-attention operation, our EA module matches visual entity features with linguistic entity features while RA module matches pair-wise visual relation features with linguistic relation features, respectively. We further integrate EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and stack several ERCBs to form our TransRefer3D for hierarchical multimodal context modeling. Extensive experiments on both Nr3D and Sr3D datasets demonstrate that our proposed model significantly outperforms existing approaches by up to 10.6% and claims the new state-of-the-art performance. To the best of our knowledge, this is the first work investigating Transformer architecture for fine-grained 3D visual grounding task.

Abstract:
Reconstructing 3D object from a single image (RGB or depth) is a fundamental problem in visual scene understanding and yet remains challenging due to its ill-posed nature and complexity in real-world scenes. To address those challenges, we adopt a primitive-based representation for 3D object, and propose a two-stage graph network for primitive-based 3D object estimation, which consists of a sequential proposal module and a graph reasoning module. Given a 2D image, our proposal module first generates a sequence of 3D primitives from input image with local feature attention. Then the graph reasoning module performs joint reasoning on a primitive graph to capture the global shape context for each primitive. Such a framework is capable of taking into account rich geometry and semantic constraints during 3D structure recovery, producing 3D objects with more coherent structure even under challenging viewing conditions. We train the entire graph neural network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet and NYU Depth V2. Extensive experiments show that our approach outperforms the previous state of the arts with a considerable margin.

Abstract:
Emotion recognition based on EEG (electroencephalography) has been widely used in human-computer interaction, distance education and health care. However, the conventional methods ignore the adjacent and symmetrical characteristics of EEG signals, which also contain salient information related to emotion. In this paper, a spatial folding ensemble network (SFE-Net) is presented for EEG feature extraction and emotion recognition. Firstly, for the undetected area between EEG electrodes, an improved Bicubic-EEG interpolation algorithm is developed for EEG channels information completion, which allows us to extract a wider range of adjacent space features. Then, motivated by the spatial symmetric mechanism of human brain, we fold the input EEG channels data with five different symmetrical strategies, which enable the proposed network to extract the information of space features of EEG signals more effectively. Finally, a 3DCNN-based spatial, temporal extraction, and a multi-voting strategy of ensemble learning are integrated to model a new neural network. With this network, the spatial features of different symmetric folding signals can be extracted simultaneously, which greatly improves the robustness and accuracy of emotion recognition. The experimental results on DEAP and SEED datasets show that the proposed algorithm has comparable performance in terms of recognition accuracy.

Abstract:
Deep learning based inpainting methods have obtained promising performance for image restoration, however current image inpainting methods still tend to produce unreasonable structures and blurry textures when processing the damaged images with heavy corruptions. In this paper, we propose a new image inpainting method termed Global Context Modeling Network (GCM-Net). By capturing the global contextual information, GCM-Net can potentially improve the performance of recovering the missing region in the damaged images with irregular masks. To be specific, we first use four convolution layers to extract the shadow features. Then, we design a progressive multi-scale fusion block termed PMSFB to extract and fuse the multi-scale features for obtaining local features. Besides, a dense context extraction (DCE) module is also designed to aggregate the local features extracted by PMSFBs. To improve the information flow, a channel attention guided residual learning module is deployed in both the DCE and PMSFB, which can reweight the learned residual features and refine the extracted information. To capture more global contextual information and enhance the representation ability, a coordinate context attention (CCA) based module is also presented. Finally, the extracted features with rich information are decoded as the image inpainting result. Extensive results on the Paris Street View, Places2 and CelebA-HQ datasets demonstrate that our method can better recover the structures and textures, and deliver significant improvements, compared with some related inpainting methods.

Abstract:
The pre-trained neural models have recently achieved impressive performance in understanding multimodal content. However, it is still very challenging to pre-train neural models for video and language understanding, especially for Chinese video-language data, due to the following reasons. Firstly, existing video-language pre-training algorithms mainly focus on the co-occurrence of words and video frames, but ignore other valuable semantic and structure information of video-language content, e.g., sequential order and spatiotemporal relationships. Secondly, there exist conflicts between video sentence alignment and other proxy tasks. Thirdly, there is a lack of large-scale and high-quality Chinese video-language datasets (eg. including 10 million unique videos), which are the fundamental success conditions for pre-training techniques. In this work, we propose a novel video-language understanding framework named Victor, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. Besides general proxy tasks such as masked language modeling, Victor constructs several novel proxy tasks under the contrastive learning paradigm, making the model be more robust and able to capture more complex multimodal semantic and structural relationships from different perspectives. Victor is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions. We apply the pre-trained Victor model to a series of downstream applications and demonstrate its superior performance, comparing against the state-of-the-art pre-training methods such as VideoBERT and UniVL.

Abstract:
Single image dehazing is a crucial and preliminary task for many computer vision applications, making progress with deep learning. The dehazing task is an ill-posed problem since the haze in the image leads to the loss of information. Thus, there are multiple feasible solutions for image restoration of a hazy image. Most existing methods learn a deterministic one-to-one mapping between a hazy image and its ground-truth, which ignores the ill-posedness of the dehazing task. To solve this problem, we propose DehazeFlow, a novel single image dehazing framework based on conditional normalizing flow. Our method learns the conditional distribution of haze-free images given a hazy image, enabling the model to sample multiple dehazed results. Furthermore, we propose an attention-based coupling layer to enhance the expression ability of a single flow step, which converts natural images into latent space and fuses features of paired data. These designs enable our model to achieve state-of-the-art performance while considering the ill-posedness of the task. We carry out sufficient experiments on both synthetic datasets and real-world hazy images to illustrate the effectiveness of our method. The extensive experiments indicate that DehazeFlow surpasses the state-of-the-art methods in terms of PSNR, SSIM, LPIPS, and subjective visual effects.

Abstract:
How can we generalize to a new prediction task at test time when it also uses a new modality as input? More importantly, how can we do this with as little annotated data as possible? This problem of cross-modal generalization is a new research milestone with concrete impact on real-world applications. For example, can an AI system start understanding spoken language from mostly written text? Or can it learn the visual steps of a new recipe from only text descriptions? In this work, we formalize cross-modal generalization as a learning paradigm to train a model that can (1) quickly perform new tasks (from new domains) while (2) being originally trained on a different input modality. Such a learning paradigm is crucial for generalization to low-resource modalities such as spoken speech in rare languages while utilizing a different high-resource modality such as text. One key technical challenge that makes it different from other learning paradigms such as meta-learning and domain adaptation is the presence of different source and target modalities which will require different encoders. We propose an effective solution based on meta-alignment, a novel method to align representation spaces using strongly and weakly paired cross-modal data while ensuring quick generalization to new tasks across different modalities. This approach uses key ideas from cross-modal learning and meta-learning, and presents strong results on the cross-modal generalization problem. We benchmark several approaches on 3 real-world classification tasks: few-shot recipe classification from text to images of recipes, object classification from images to audio of objects, and language classification from text to spoken speech across 100 languages spanning many rare languages. Our results demonstrate strong performance even when the new target modality has only a few (1-10) labeled samples and in the presence of noisy labels, a scenario particularly prevalent in low-resource modalities.

Abstract:
Creative image animations are attractive in e-commerce applications, where motion transfer is one of the import ways to generate animations from static images. However, existing methods rarely transfer motion to objects other than human body or human face, and even fewer apply motion transfer in practical scenarios. In this work, we apply motion transfer on the Taobao product images in real e-commerce scenario to generate creative animations, which are more attractive than static images and they will bring more benefits. We animate the Taobao products of dolls, copper running horses and toy dinosaurs based on motion transfer method for demonstration.

Abstract:
Montreal Cognitive Assessment (MoCA) test is an auxiliary medical screening method for Alzheimer's disease (AD). During the traditional process, a testee is required to conduct several test items on the paper questionnaire following the guidance of a medical staff. It is inefficient and dependents largely on the doctor's subjective judgment and experience level. Therefore, we propose an Interactive and Intelligent AD Auxiliary Screening (IAS) system consisting of speech-based Interactive Unit Testing Module (IUTM) and truth-based Intelligent Analysis Module (IAM), both of which are developed by deep learning techniques. Following the guidance of voice commands, the testee could achieve the MoCA test independently in IUTM just by a mobile device, and then the testing data is analyzed accurately and objectively by IAM. Moreover, the electronic system is beneficial to collect and analyze clinical data for further research compared to the traditional method. And the system is deployed in the Department of Neurology, Sichuan Provincial People's Hospital in June 2021 and has been used in the clinical screening of Alzheimer's disease.

Abstract:
The distribution of royalty fees to music right holders is slow and inefficient due to the lack of automation in music recognition and music licensing processes. The challenge for an improved system is to recognise different versions of a music such as remix or cover versions, leading to clear assessment and unique identification of each music work. Through our music data matching system called MDMS, we query many indexed and stored music pieces with a small part of a music piece. The system retrieves the closest stored variant of the input query by using music fingerprints of the underlying melody together with signal processing techniques. Tailored indices based on fingerprint hashes accelerate processing across a large corpus of stored music. Results are found even if the stored versions vary from the query song in terms of one or more music features --- tempo, key/mode, presence of instruments/vocals, and singer --- and the differences are highlighted in the output.

Abstract:
We demonstrate an end-to-end intelligent system of short-video generation for live-streaming, namely "VideoDiscovery'', which aims to automatically produce batches of high-value short-videos by discovering and organizing highlight content for commodity delivery. Traditionally, production of high-value short-videos for live-streaming is cost-expensive and time-consuming, which also demands experienced editing skills. To this end, we construct this system with three modules: 1)Semantic segment structuring first decodes live-streaming into a series of semantic candidates including commodity, Q&A, action, multi-modal, etc. 2)Hierarchical search engine performs automatically searches for semantically matching candidate shots from scripts. 3)Script-aware shot assembly is formulated combination problem over a graph of shots, considering temporal constraints and candidate idioms. Specifically, given an input live-streaming, the recommended video results illustrate diverse visual-semantic content, and follow script guidelines. Currently, our system has been launched online for Taobao stores, which enables to generate appealing videos in minutes for advertising and recommendation. The entry of our system is available at https://discovery.aliyun.com/index.

Abstract:
Meetings are a necessary part of the operations of any institution, whether they are held online or in-person. However, meeting transcription and summarization are always painful requirements since they involve tedious human effort. This drives the need for automatic meeting transcription and summarization (AMTS) systems. A successful AMTS system relies on systematic integration of multiple natural language processing (NLP) techniques, such as automatic speech recognition, speaker identification, and meeting summarization, which are traditionally developed separately and validated offline with standard datasets. In this demonstration, we provide a novel productive meeting tool named SmartMeeting, which enables users to automatically record, transcribe, summarize, and manage the information in an in-person meeting. SmartMeeting transcribes every word on the fly, enriches the transcript with speaker identification and voice separation, and extracts essential decisions and crucial insights automatically. In our demonstration, the audience can experience the great potential of the state-of-the-art NLP techniques in this real-life application.

Abstract:
Text-driven 3D avatar animation has been an essential part of virtual human techniques, which has a wide range of applications in movie, digital games and video streaming. In this work, we introduce a practical system which drives both facial and body movements of 3D avatar by text input. Our proposed system first converts text input to speech signal and conducts text analysis to extract semantic tags simultaneously. Then we generate the lip movements from the synthetic speech, and meanwhile facial expression and body movement are generated by the joint modeling of speech and textual information, which can drive our virtual 3D avatar talking and acting like a real human.

Abstract:
Film sound reproduction is the process of converting the image-form film soundtrack to wave-form movie sound. In this paper, a novel optical imaging based reproduction framework is proposed with the basic idea that restoring film audio damage in the image domain. In traditional reproduction method, the scanning light emitted by film projector causes inversible physical damage to the flammable film soundtrack (made of Nitrate compounds). By using optical imaging method in film soundtrack capturing, our framework can avoid the damage and the self-ignition problem. Experiment results show that our framework can improve the reproduction speed to 2 times while maintaining equal sound quality. Also, the sound sampling rate can be enhanced to 162.08%.

Abstract:
Side information of items, e.g., images and text description, has shown to be effective in contributing to accurate recommendations. Inspired by the recent success of pre-training models on natural language and images, we propose a pre-training strategy to learn item representations by considering both item side information and their relationships. We relate items by common user activities, e.g., co-purchase, and construct a homogeneous item graph. This graph provides a unified view of item relations and their associated side information in multimodality. We develop a novel sampling algorithm named MCNSampling to select contextual neighbors for each item. The proposed Pre-trained Multimodal Graph Transformer (PMGT) learns item representations with two objectives: 1) graph structure reconstruction, and 2) masked node feature reconstruction. Experimental results on real datasets demonstrate that the proposed PMGT model effectively exploits the multimodality side information to achieve better accuracies in downstream tasks including item recommendation and click-through ratio prediction. In addition, we also report a case study of testing PMGT in an online setting with 600 thousand users.

Abstract:
Optical degradation blurs text shapes and edges, so existing scene text recognition methods have difficulties in achieving desirable results on low-resolution (LR) scene text images acquired in real-world environments. The above problem can be solved by efficiently extracting sequential information to reconstruct super-resolution (SR) text images, which remains a challenging task. In this paper, we propose a Parallelly Contextual Attention Network (PCAN), which effectively learns sequence-dependent features and focuses more on high-frequency information of the reconstruction in text images. Firstly, we explore the importance of sequence-dependent features in horizontal and vertical directions parallelly for text SR, and then design a parallelly contextual attention block to adaptively select the key information in the text sequence that contributes to image super-resolution. Secondly, we propose a hierarchically orthogonal texture-aware attention module and an edge guidance loss function, which can help to reconstruct high-frequency information in text images. Finally, we conduct extensive experiments on TextZoom dataset, and the results can be easily incorporated into mainstream text recognition algorithms to further improve their performance in LR image recognition. Besides, our approach exhibits great robustness in defending against adversarial attacks on seven mainstream scene text recognition datasets, which means it can also improve the security of the text recognition pipeline. Compared with directly recognizing LR images, our method can respectively improve the recognition accuracy of ASTER, MORAN, and CRNN by 14.9%, 14.0%, and 20.1%. Our method outperforms eleven state-of-the-art (SOTA) SR methods in terms of boosting text recognition performance. Most importantly, it outperforms the current optimal text-orient SR method TSRN by 3.2%, 3.7%, and 6.0% on the recognition accuracy of ASTER, MORAN, and CRNN respectively.

Abstract:
Exponential growth in multimedia streaming traffic over the Internet motivates the research and further investigation of the user's perceived quality of such services. Enhancement of experienced quality by the users becomes more substantial when service providers compete on establishing superiority by gaining more subscribers or customers. Quality of Experience (QoE) enhancement would not be possible without an authentic and accurate assessment of the streaming sessions. HTTP Adaptive Streaming (HAS) is today's prevailing technique to deliver the highest possible audio and video content quality to the users. An end-to-end evaluation of QoE in HAS covers the precise measurement of the metrics that affect the perceived quality, eg. startup delay, stall events, and delivered media quality. Mentioned metrics improvements could limit the service's scalability, which is an important factor in real-world scenarios. In this study, we will investigate the stated metrics, best practices and evaluations methods, and available techniques with an aim to (i) design and develop practical and scalable measurement tools and prototypes, (ii) provide a better understanding of current technologies and techniques (eg. Adaptive Bitrate algorithms), (iii) conduct in-depth research on the significant metrics in a way that improvements of QoE with scalability in mind would be feasible, and finally (iv) provide a comprehensive QoE model which outperforms state-of-the-art models.

Abstract:
Anomaly detection has been a very challenging and active area of research for decades, particularly for video surveillance. However, most of the works detect predefined anomaly classes using static models. These frameworks have limited applicability for real-life surveillance where the data have concept drift. Under concept drift, the distribution of both normal and anomaly classes changes over time. An event may change its class from anomaly to normal or vice-versa. The non-adaptive frameworks do not handle this drift. Additionally, the focus has been on detecting local anomalies, such as a region of an image. In contrast, in CCTV-based monitoring, flagging unseen anomalous situations can be of greater interest. Utilizing multiple sensory information for anomaly detection has also received less attention. This extended abstract discusses these gaps and possible solutions.

Abstract:
Few-shot learning aims at rapidly adapting to novel categories with only a handful of samples at test time, which has been predominantly tackled with the idea of meta-learning. However, meta-learning approaches essentially learn across a variety of few-shot tasks and thus still require large-scale training data with fine-grained supervision to derive a generalized model, thereby involving prohibitive annotation cost. In this paper, we advance the few-shot classification paradigm towards a more challenging scenario, i.e, cross-granularity few-shot classification, where the model observes only coarse labels during training while is expected to perform fine-grained classification during testing. This task largely relieves the annotation cost since fine-grained labeling usually requires strong domain-specific expertise. To bridge the cross-granularity gap, we approximate the fine-grained data distribution by greedy clustering of each coarse-class into pseudo-fine-classes according to the similarity of image embeddings. We then propose a meta-embedder that jointly optimizes the visual- and semantic-discrimination, in both instance-wise and coarse class-wise, to obtain a good feature space for this coarse-to-fine pseudo-labeling process. Extensive experiments and ablation studies are conducted to demonstrate the effectiveness and robustness of our approach on three representative datasets.

Abstract:
Deep learning has made a tremendous impact on various applications in multimedia, such as media interpretation and multimodal retrieval. However, deep learning models usually require a large amount of labeled data to achieve satisfactory performance. In multimedia analysis, domain adaptation studies the problem of cross-domain knowledge transfer from a label rich source domain to a label scarce target domain, thus potentially alleviates the annotation requirement for deep learning models. However, we find that contemporary domain adaptation methods for cross-domain image understanding perform poorly when source domain is noisy. Weakly Supervised Domain Adaptation (WSDA) studies the domain adaptation problem under the scenario where source data can be noisy. Prior methods on WSDA remove noisy source data and align the marginal distribution across domains without considering the fine-grained semantic structure in the embedding space, which have the problem of class misalignment, e.g., features of cats in the target domain might be mapped near features of dogs in the source domain. In this paper, we propose a novel method, termed Noise Tolerant Domain Adaptation (NTDA), for WSDA. Specifically, we adopt the cluster assumption and learn cluster discriminatively with class prototypes (centroids) in the embedding space. We propose to leverage the location information of the data points in the embedding space and model the location information with a Gaussian mixture model to identify noisy source data. We then design a network which incorporates the Gaussian mixture noise model as a sub-module for unsupervised noise removal and propose a novel cluster-level adversarial adaptation method based on the Generative Adversarial Network (GAN) framework which aligns unlabeled target data with the less noisy class prototypes for mapping the semantic structure across domains. Finally, we devise a simple and effective algorithm to train the network from end to end. We conduct extensive experiments to evaluate the effectiveness of our method on both general images and medical images from COVID-19 and e-commerce datasets. The results show that our method significantly outperforms state-of-the-art WSDA methods.

Abstract:
Image manipulation chain detection aims to identify the existence of involved operations and also their orders, playing an important role in multimedia forensics and image analysis. However,all the existing algorithms model the manipulation chain detection as a classification problem, and can only detect chains containing up to two operations. Due to the exponentially increased solution space and the complex interactions among operations, how to reveal a long chain from a processed image remains a long-standing problem in the multimedia forensic community. To address this challenge, in this paper, we propose a new direction for manipulation chain detection. Different from previous works, we treat the manipulation chain detection as a machine translation problem rather than a classification one, where we model the chains as the sentences of a target language, and each word serves as one possible image operation. Specifically, we first transform the manipulated image into a deep feature space, and further model the traces left by the manipulation chain as a sentence of a latent source language. Then, we propose to detect the manipulation chain through learning the mapping from the source language to the target one under a machine translation framework. Our method can detect manipulation chains consisting of up to five operations, and we obtain promising results on both the short-chain detection and the long-chain detection.

Abstract:
In recent years, DeepFake is becoming a common threat to our society, due to the remarkable progress of generative adversarial networks (GAN) in image synthesis. Unfortunately, existing studies that propose various approaches, in fighting against DeepFake and determining if the facial image is real or fake, is still at an early stage. Obviously, the current DeepFake detection method struggles to catch the rapid progress of GANs, especially in the adversarial scenarios where attackers can evade the detection intentionally, such as adding perturbations to fool the DNN-based detectors. While passive detection simply tells whether the image is fake or real, DeepFake provenance, on the other hand, provides clues for tracking the sources in DeepFake forensics. Thus, the tracked fake images could be blocked immediately by administrators and avoid further spread in social networks.

Abstract:
Nowadays, the interest of real-time video streaming reaches a peak. To deal with the problem of packet loss and optimize users' Quality of Experience (QoE), Forward error correction (FEC) has been studied and applied extensively. The performance of FEC depends on whether the future loss pattern is precisely predicted, while the previous researches have not provided a robust packet loss prediction method. In this work, we propose LightFEC to make accurate and fast prediction of packet loss pattern. By applying long short-term memory (LSTM) networks, clustering algorithms and model compression methods, LightFEC is able to accurately predict packet loss in various network conditions without consuming too much time. According to the results of well-designed experiments, we find out that LightFEC outperforms other schemes on prediction accuracy, which improves the packet recovery ratio while keeping the redundancy ratio at a low level.

Abstract:
Recently, convolutional neural network (CNN) has been the core ingredient of modern models, triggering the surge of deep learning in super-resolution (SR). Despite the great success of these CNN-based methods which are prone to be deeper and heavier, it is impracticable to directly apply these methods for some low-budget devices due to the superfluous computational overhead. To alleviate this problem, a novel lightweight SR network named progressive feature fusion network (PFFN) is developed to seek for better balance between performance and running efficiency. Specifically, to fully exploit the feature maps, a novel progressive attention block (PAB) is proposed as the main building block of PFFN. The proposed PAB adopts several parallel but connected paths with pixel attention, which could significantly increase the receptive field of each layer, distill useful information and finally learn more discriminative feature representations. In PAB, a powerful dual attention module (DAM) is further incorporated to provide the channel and spatial attention mechanism in fairly lightweight manner. Besides, we construct a pretty concise and effective upsampling module with the help of multi-scale pixel attention, named MPAU. All of the above modules ensure the network can benefit from attention mechanism while still being lightweight enough. Furthermore, a novel training strategy following the cosine annealing learning scheme is proposed to maximize the representation ability of the model. Comprehensive experiments show that our PFFN achieves the best performance against all existing lightweight state-of-the-art SR methods with less number of parameters and even performs comparably to computationally expensive networks.

Abstract:
Visually-aware recommendation on E-commerce platforms aims to leverage visual information of items to predict a user's preference for these items in addition to the historical user-item interaction records. It is commonly observed that user's attention to visual features does not always reflect the real preference. Although a user may click and view an item in light of a visual satisfaction of their expectations, a real purchase does not always occur due to the unsatisfaction of other essential features (e.g., brand, material, price). We refer to the reason for such a visually related interaction deviating from the real preference as a visual bias. Existing visually-aware models make use of the visual features as a separate collaborative signal similarly to other features to directly predict the user's preference without considering a potential bias, which gives rise to a visually biased recommendation. In this paper, we derive a causal graph to identify and analyze the visual bias of these existing methods. In this causal graph, the visual feature of an item acts as a mediator, which could introduce a spurious relationship between the user and the item. To eliminate this spurious relationship that misleads the prediction of the user's real preference, an intervention and a counterfactual inference are developed over the mediator. Particularly, the Total Indirect Effect is applied for a debiased prediction during the testing phase of the model. This causal inference framework is model agnostic such that it can be integrated into the existing methods. Furthermore, we propose a debiased visually-aware recommender system, denoted as CausalRec to effectively retain the supportive significance of the visual information and remove the visual bias. Extensive experiments are conducted on eight benchmark datasets, which shows the state-of-the-art performance of CausalRec and the efficacy of debiasing.

Abstract:
This paper tackles a recently proposed Video Corpus Moment Retrieval task. This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus. We propose a novel CONtextual QUery-awarE Ranking~(CONQUER) model for effective moment localization and ranking. CONQUER explores query context for multi-modal fusion and representation learning in two different steps. The first step derives fusion weights for the adaptive combination of multi-modal video content. The second step performs bi-directional attention to tightly couple video and query as a single joint representation for moment localization. As query context is fully engaged in video representation learning, from feature fusion to transformation, the resulting feature is user-centered and has a larger capacity in capturing multi-modal signals specific to query. We conduct studies on two datasets, TVR for closed-world TV episodes and DiDeMo for open-world user-generated videos, to investigate the potential advantages of fusing video and query online as a joint representation for moment retrieval.

Abstract:
Ghosting artifacts and missing content due to the over-/under-saturated regions caused by misalignments are generally considered as the two key challenges in high dynamic range (HDR) imaging for dynamic scenes. However, previous CNN-based methods directly reconstruct the HDR image from the input low dynamic range (LDR) images, with implicit ghost removal and multi-exposure image fusion in an end-to-end network structure. In this paper, we decompose HDR imaging into ghost-free image fusion and ghost-based image restoration, and propose a novel practical Hierarchical Fusion Network (HFNet), which contains three sub-networks: Mask Fusion Network, Mask Compensation Network, and Refine Network. Specifically, LDR images are linearly fused in Mask Fusion Network ignoring the misaligned regions. Then the ghost regions of fusion image are restored with mask compensation. Finally, all these results are refined in the third network. This strategy of divide and rule makes the proposed method significantly more tiny than previous methods. Experiments on different datasets show that superior performance of HFNet with 9x fewer FLOPs, 4x fewer parameters and 3x faster inference speed than the existing methods while providing comparable accuracy. And it achieves state-of-the-art quantitative and qualitative results while applied with similar FLOPs.

Abstract:
Scene Graph Generation (SGG) aims to parse the image as a set of semantics, containing objects and their relations. Currently, the SGG methods only stay at presenting the intuitive detection in the image, such as the triplet "logo on board". Intuitively, we humans can further refine these intuitive detections as rational descriptions like "flower painted on surfboard". However, most of existing methods always formulate SGG as a straightforward task, only limited by the manner of one-time prediction, which focuses on a single-pass pipeline and predicts all the semantic. Therefore, to handle this problem, we propose a novel multi-step reasoning manner for SGG. Concretely, we break SGG into two explicit learning stages, including intuitive training stage (ITS) and rational training stage (RTS). In the first stage, we follow the traditional SGG processing to detect objects and relationships, yielding an intuitive scene graph. In the second stage, we perform multi-step reasoning to refine the intuitive scene graph. For each step of reasoning, it consists of two kinds of operations: mask and predict. According to primary predictions and their confidences, we constantly select and mask the low-confidence predictions, which features are optimized and predicted again. After several iterations, all of intuitive semantics will gradually tend to be revised with high confidences, yielding a rational scene graph. Extensive experiments on Visual Genome prove the superiority of the proposed method. Additional ablation studies and visualization cases further validate its effectiveness.

Abstract:
Music source separation from a sound mixture remains a big challenge because there often exist heavy overlaps and interactions among similar music signals. In order to correctly separate mixed sources, we propose a novel Fine-grained Cycle-Separation Network (FCSN) for vision-guided music source separation. With the guidance of visual features, the proposed FCSN approach preliminarily separated music sources by minimizing the residual spectrogram which is calculated by removing preliminarily separated music spectrograms from the original music mixture. The separation is repeated several times until the residual spectrogram becomes empty or leaves only noise. Extensive experiments are performed on three large-scale datasets, the MUSIC (MUSIC-21), the AudioSet, and the VGGSound. Our approach outperforms state-of-the-art approaches in all datasets, and both separation accuracies and visualization results demonstrate its effectiveness for solving the problem of overlap and interaction in music source separation.

Abstract:
In fine-grained image recognition (FGIR), the localization and amplification of region attention is an important factor, which has been explored extensively convolutional neural networks (CNNs) based approaches. The recently developed vision transformer (ViT) has achieved promising results in computer vision tasks. Compared with CNNs, Image sequentialization is a brand new manner. However, ViT is limited in its receptive field size and thus lacks local attention like CNNs due to the fixed size of its patches, and is unable to generate multi-scale features to learn discriminative region attention. To facilitate the learning of discriminative region attention without box/part annotations, we use the strength of the attention weights to measure the importance of the patch tokens corresponding to the raw images. We propose the recurrent attention multi-scale transformer (RAMS-Trans), which uses the transformer's self-attention to recursively learn discriminative region attention in a multi-scale manner. Specifically, at the core of our approach lies the dynamic patch proposal module (DPPM) responsible for guiding region amplification to complete the integration of multi-scale image patches. The DPPM starts with the full-size image patches and iteratively scales up the region attention to generate new patches from global to local by the intensity of the attention weights generated at each scale as an indicator. Our approach requires only the attention weights that come with ViT itself and can be easily trained end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better than exising works, in addition to efficient CNN models, achieving state-of-the-art results on three benchmark datasets.

Abstract:
In this paper, we build on a concept of self-supervision by taking RGB frames as input to learn to predict both action concepts and auxiliary descriptors e.g., object descriptors. So-called hallucination streams are trained to predict auxiliary cues, simultaneously fed into classification layers, and then hallucinated at the testing stage to aid network. We design and hallucinate two descriptors, one leveraging four popular object detectors applied to training videos, and the other leveraging image- and video-level saliency detectors. The first descriptor encodes the detector- and Image Net-wise class prediction scores, confidence scores, and spatial locations of bounding boxes and frame indexes to capture the spatio-temporal distribution of features per video. Another descriptor encodes spatio-angular gradient distributions of saliency maps and intensity patterns. Inspired by the characteristic function of the probability distribution, we capture four statistical moments on the above intermediate descriptors. As numbers of coefficients in the mean, covariance, coskewness and cokurtotsis grow linearly, quadratically, cubically and quartically w.r.t. the dimension of feature vectors, we describe the covariance matrix by its leading n' eigenvectors (so-called subspace) and we capture skewness/kurtosis rather than costly coskewness/cokurtosis. We obtain state of the art on five popular datasets such as Charades and EPIC-Kitchens.

Abstract:
The task of skeleton-based action recognition remains a core challenge in human-centred scene understanding due to the multiple granularities and large variation in human motion. Existing approaches typically employ a single neural representation for different motion patterns, which has difficulty in capturing fine-grained action classes given limited training data. To address the aforementioned problems, we propose a novel multi-granular spatio-temporal graph network for skeleton-based action classification that jointly models the coarse- and fine-grained skeleton motion patterns. To this end, we develop a dual-head graph network consisting of two interleaved branches, which enables us to extract features at two spatio-temporal resolutions in an effective and efficient manner. Moreover, our network utilises a cross-head communication strategy to mutually enhance the representations of both heads. We conducted extensive experiments on three large-scale datasets, namely NTU RGB+D 60, NTU RGB+D 120, and Kinetics-Skeleton, and achieves the state-of-the-art performance on all the benchmarks, which validates the effectiveness of our method1.

Abstract:
Improving robustness against data missing has become one of the core challenges in Multimodal Sentiment Analysis (MSA), which aims to judge speaker sentiments from the language, visual, and acoustic signals. In the current research, translation-based methods and tensor regularization methods are proposed for MSA with incomplete modality features. However, both of them fail to cope with random modality feature missing in non-aligned sequences. In this paper, a transformer-based feature reconstruction network (TFR-Net) is proposed to improve the robustness of models for the random missing in non-aligned modality sequences. First, intra-modal and inter-modal attention-based extractors are adopted to learn robust representations for each element in modality sequences. Then, a reconstruction module is proposed to generate the missing modality features. With the supervision of SmoothL1Loss between generated and complete sequences, TFR-Net is expected to learn semantic-level features corresponding to missing features. Extensive experiments on two public benchmark datasets show that our model achieves good results against data missing across various missing modality combinations and various missing degrees.

Abstract:
Superimposing visible watermarks on images provides a powerful weapon to cope with the copyright issue. Watermark removal techniques, which can strengthen the robustness of visible watermarks in an adversarial way, have attracted increasing research interest. Modern watermark removal methods perform watermark localization and background restoration simultaneously, which could be viewed as a multi-task learning problem. However, existing approaches suffer from incomplete detected watermark and degraded texture quality of restored background. Therefore, we design a two-stage multi-task network to address the above issues. The coarse stage consists of a watermark branch and a background branch, in which the watermark branch self-calibrates the roughly estimated mask and passes the calibrated mask to background branch to reconstruct the watermarked area. In the refinement stage, we integrate multi-level features to improve the texture quality of watermarked area. Extensive experiments on two datasets demonstrate the effectiveness of our proposed method.

Abstract:
Contour detection plays an important role in both academic research and real-world applications. As the basic building block of many applications, its accuracy and efficiency highly influence the subsequent stages. In this work, we propose a novel lightweight system for contour detection that achieves state-of-the-art performance while keeps ultra-slim model size. The proposed method is built on an efficient encoder in a bottom-up/top-down fashion. Specially, we propose a novel decoder that compresses side features from an encoder and effectively decodes compact contextual information for high-accurate boundary localization. Besides, we propose a novel loss function that is able to assist a model to produce crisp object boundaries.

Abstract:
Pedestrian detection in the night surveillance is a challenging yet not largely explored task. As the success of the detector in the daytime surveillance and the convenient acquisition of all-weather data, we learn knowledge from these data to benefit pedestrian detection in night surveillance. We find two key properties of surveillance: distribution cross-time consistency and background cross-frame constancy. This paper proposes a consistency-constancy bi-knowledge learning (CCBL) for pedestrian detection in night surveillance, which is able to simultaneously achieve the night pedestrian detection's useful knowledge, coming from day and night surveillance. Firstly, based on the robustness of the existing detector in day surveillance, we obtain pedestrians' distribution in the daytime scene using the detector's detection results in the daytime scene. Based on the consistency of pedestrians' distribution during the day and night in the same scene, the pedestrian distribution from daytime is used as the consistency-knowledge for pedestrian detection in night surveillance. Secondly, the background as a constant knowledge of the surveillance scene is extractable and contributes to the division of the foreground, which contains most of the pedestrian regions and helps in pedestrian detection for night surveillance. Finally, we add bi-knowledge representation to promote each other and merge them together as the final pedestrian representation. Through extensive experiments, our CCBL significantly outperforms the state-of-the-art methods on public pedestrian detection datasets. In the NightSurveillance dataset, CCBL reduced the average missed detection rate by 3.04% compared to the existing best method.

Abstract:
Visual Question Answering (VQA) is a vital yet challenging task in the field of multimedia comprehension. In order to correctly answer questions about an image, a VQA model requires to sufficiently understand the visual scene, especially the vision-semantic reasonings between the two modalities. Traditional relation-based methods allow to encode the pairwise relations of objects to boost the VQA model performance. However, this simple strategy is deficient to exploit the abundant concepts expressed by the composition of diverse image objects, leading to sub-optimal performance. In this paper, we propose a focal and composed vision-semantic modeling method, which is a trainable end-to-end model, for better vision-semantic redundancy removal and compositionality modeling. Concretely, we first introduce the LENA cell, a plug-and-play reasoning module, which removes redundant semantic by a focal mechanism in the first step, followed by the vision-semantic compositionality modeling for better visual reasoning. We then incorporate the cell into a full LENA network, which progressively refines multimodal composed representations, and can be leveraged to infer the high-order vision-semantic in a multi-step learning way. Extensive experiments on two benchmark datasets, i.e., VQA v2 and VQA-CP v2, verify the superiority of our model as compared with several state-of-the-art baselines.

Abstract:
Despite the success of deep neural network (DNN) on sequential data (i.e., scene text and speech) recognition, it suffers from the over-confidence problem mainly due to overfitting in training with the cross-entropy loss, which may make the decision-making less reliable. Confidence calibration has been recently proposed as one effective solution to this problem. Nevertheless, the majority of existing confidence calibration methods aims at non-sequential data, which is limited if directly applied to sequential data since the intrinsic contextual dependency in sequences or the class-specific statistical prior is seldom exploited. To the end, we propose a Context-Aware Selective Label Smoothing (CASLS) method for calibrating sequential data. The proposed CASLS fully leverages the contextual dependency in sequences to construct confusion matrices of contextual prediction statistics over different classes. Class-specific error rates are then used to adjust the weights of smoothing strength in order to achieve adaptive calibration. Experimental results on sequence recognition tasks, including scene text recognition and speech recognition, demonstrate that our method can achieve the state-of-the-art performance.

Abstract:
Sarcasm is a peculiar form and sophisticated linguistic act to express the incongruity of someone's implied sentiment expression, which is a pervasive phenomenon in social media platforms. Compared with sarcasm detection purely on texts, multi-modal sarcasm detection is more adapted to the rapidly growing social media platforms, where people are interested in creating multi-modal messages. When focusing on the multi-modal sarcasm detection for tweets consisting of texts and images on Twitter, the significant clue of improving the performance of multi-modal sarcasm detection evolves into how to determine the incongruity relations between texts and images. In this paper, we investigate multi-modal sarcasm detection from a novel perspective, so as to determine the sentiment inconsistencies within a certain modality and across different modalities by constructing heterogeneous in-modal and cross-modal graphs (InCrossMGs) for each multi-modal example. Based on it, we explore an interactive graph convolution network (GCN) structure to jointly and interactively learn the incongruity relations of in-modal and cross-modal graphs for determining the significant clues in sarcasm detection. Experimental results demonstrate that our proposed model achieves state-of-the-art performance in multi-modal sarcasm detection.

Abstract:
Most applications of interactive multimedia require the data to arrive within the specific acceptable end-to-end latency (i.e., meeting deadline). To avoid efforts being wasted, the content must reach the destination before the deadline. In our work, we propose DAP (Deadline And Priority-aware congestion control) to achieve high throughput within acceptable end-to-end latency, especially to send high-priority packets while meeting deadline requirements. DAP is mainly composed of two modules: i) the scheduler decides which packet should be sent at first w.r.t the reward function with fully considering the packets' priority, deadline, and current network conditions. ii) the deadline-sensitive congestion control module transmits packets with high efficiency while guaranteeing the end-to-end latency. Specifically, we propose an improved packet-pair scheme to adjust the best congestion window corresponding to the Bandwidth-Delay Product and to update the instant sending rate by current queue length. Experimental results demonstrate the significant performance of our scheme and DAP ranks first in both the training phase and final phase of the ACM MM 2021 Grand Challenge: Meet Deadline Requirements.

Abstract:
This paper describes our bronze-medal solution for the video captioning task of the ACMMM2021 Pre-Training for Video Understanding Challenge. We depart from the Bottom-Up-Top-Down model, with technical improvements on both video content encoding and caption decoding. For encoding, we propose to extract multi-level video features that describe holistic scenes and fine-grained key objects, respectively. The scene-level and object-level features are enhanced separately by multi-head self-attention mechanisms before feeding them into the decoding module. Towards generating content-relevant and human-like captions, we train our network end-to-end by semantic-reinforced learning. Finally, in order to select the best caption from captions produced by distinct models, we perform caption reranking by cross-modal matching between a given video and each candidate caption. Both internal experiments on the MSR-VTT test set and external evaluations by the challenge organizers justify the viability of the proposed solution.

Abstract:
This article introduces the solution of the third place team green hand for ACM Multimedia 2021 Security AI Challenger Phase 7: Robust defense competition for e-commerce logo detection. In this work, we use the DetectoRS model and 4 strategies, including resampling, equalization loss v2, data augmentation and weighted boxes fusion. It aims to solve the three main problems in the competition, including small target detection, long-tail distribution and adversarial examples. The final results show that our model achieves an evaluation score of 0.654611 in this semi-final, which ranks first in semi-final and third place in final among all 36,489 teams.

Abstract:
Logo detection is an important task in the intellectual property protection in e-commerce. In the paper, we introduce our solution for the ACM MM2021 Robust Logo Detection Grand Challenge. The competition requires the detection of logos (515 categories) in e-commerce images. This competition is challenged by long-tail distribution, small objects, and different types of noises. To overcome these challenges, we built a highly optimized and robust detector. We first tested many effective techniques for general object detection and then focused on data augmentation. We found that data augmentation was effective in improving the performance and robustness of logo detection. Based on the combination of these techniques, we achieved APs of 64.6% and 61.3% on the clean and noisy datasets respectively, which were improved by 8.1% and 19.5% relative to the official baseline. We ranked 5th among 36489 teams in the competition.

Abstract:
Emerging multimedia applications like VR, AR, etc., exhibit unique transmission features, such as block-based transmission, dynamic prioritization for different contents, and deadline-aware delivery, which should be carefully managed but fail to be considered in the design of existing transmission control algorithms. In this work, we propose a delay-sensitive congestion control algorithm with a hybrid of coarse-grained and fine-grained control to improve the QoE scores. The coarse-grained control scheme maintains a low queuing delay and avoids missing the deadline in the steady state. The fine-grained control scheme rapidly reacts to the network dynamics based on our bandwidth estimation model. For the block scheduling, we heuristically model the realistic priority of each block by examining the trade-off among the remaining time, the remaining size, and the priority score of each block. Extensive experiments are conducted to evaluate the performance of our solution, which show that our solution significantly outperforms other baseline algorithms.

Abstract:
Delay-sensitive multimedia streaming applications require their data to be delivered before a deadline to be useful. The data transmitted by these applications can usually be partitioned into blocks with different priorities, assigned based on the impact of a block on the Quality of Experience (QoE) if it misses its delivery deadline. Meet their deadline requirements is challenging due to the dynamics of the network and these applications' high demand on network resources. To encourage the research community to address this challenge, we organize the "Meet Deadline Requirements" Grand Challenge at ACM Multimedia 2021. This grand challenge provides a simulation platform onto which the participants can implement their block scheduler and bandwidth estimator and then benchmark against each other using a common set of application traces and network traces.

Abstract:
Playtesting is widely performed in the game industry to gauge the difficulty of a game. A large number of test participants with different skills must be recruited for reliable test results, resulting in high costs. Automated playtesting based on player simulation is expected to reduce playtesting costs. Still, it has not yet been widely applied due to the lack of a method that realistically simulates players' gameplays with different skills. Based on a cognitive model of sensorimotor coordination that explains the human button input process, we propose a novel automated playtesting technique that predicts the game difficulty experienced by players with different skills in moving-target acquisition (MTA) games. The model has free parameters representing the inherent skills of players. Once the parameters are obtained for a specific population (e.g., seniors), it is possible to estimate the game difficulty at the population level in multiple games. We applied the technique to two simple MTA games and showed that it could predict the relative difference in game difficulties experienced by players with different skills.

Abstract:
Visual dialog is a fundamental vision-language task where an AI agent holds a meaningful dialogue about visual content with humans in nature. However, this task remains challenging, since there is still no consensus way to capture rich visual contextual information contained in the environment rather than only focusing on visual objects. Furthermore, conventional methods suffer from the single-answer learning strategy, where it only accepts one correct answer without considering the diverse expressions of the language (i.e., one identical meaning but multiple expressions via rephrasing or adopting synonyms etc). In this paper, we introduce Contextual-Aware Representation and linguistic-diverse Expression (CARE), a novel plug-and-play framework with contextual-based graph embedding and curriculum contrastive learning to solve the above two issues. Specifically, the contextual-based graph embedding (CGE) module aims to integrate the environmental context information with visual objects to improve the answer quality. In addition, we propose a curriculum contrastive learning (CCL) paradigm to imitate the learning habits of humans when facing a question with multiple correct answers sharing the same meaning but with diverse expressions. To support CCL, a CCL loss is designed to progressively strengthen the model's ability in identifying the answers with correct semantics. Extensive experiments are conducted on two benchmark datasets, and our proposed method outperforms the state-of-the-arts by a considerable margin on VisDial V1.0 (4.63% NDCG) and VisDial V0.9 (1.27% MRR, 1.74% R@1, 0.87% R@5, 1.28% R@10, 0.26 Mean.

Abstract:
The current state-of-the-art image-sentence retrieval methods implicitly align the visual-textual fragments, like regions in images and words in sentences, and adopt attention modules to highlight the relevance of cross-modal semantic correspondences. However, the retrieval performance remains unsatisfactory due to a lack of consistent representation in both semantics and structural spaces. In this work, we propose to address the above issue from two aspects: (i) constructing intrinsic structure (along with relations) among the fragments of respective modalities, e.g., "dog → play → ball" in semantic structure for an image, and (ii) seeking explicit inter-modal structural and semantic correspondence between the visual and textual modalities.

Abstract:
Monocular Depth Estimation (MDE) is a fundamental task in computer vision and multimedia. With the wide applications of deep Convolutional Neural Networks (CNNs), learning-based methods have achieved superior performance on MDE tasks in recent years. Because loss functions are important to train an accurate CNN with good generalization performance, nearly all previous efforts contribute to proposing powerful loss functions with careful hand-crafted regularizers(e.g., gradient loss and normal loss) added to the basic depth L1-Loss. However, the hand-crafted regularizers require rich domain knowledge, while their performance can still not be guaranteed. In this paper, we learn a new regularizer, approximated by a tiny CNN Regularizrer-Net(RN), and train it in an adversarial way. As demonstrated experimentally, our learned regularizer can notably outperform the current state-of-the-art methods by both quantitative evaluation and qualitative visualization on the benchmark NYU-Depth-v2 dataset, and well generalize to the new ScanNet dataset without any further training. Our code will be released soon.

Abstract:
In urban surveillance systems, finding a specific vehicle in video frames efficiently and accurately has always been an essential part of traffic supervision and criminal investigation. Existing studies focus on vehicle re-identification (re-ID), but vehicle search is still underexploited. These methods depend on the locations of many vehicles (bounding boxes) that are not available in most real-world applications. Therefore, the unsupervised joint study of vehicle location and identification for the observed scene is a pressing need. Inspired by person search, we conduct a study on the vehicle search while considering four main discrepancies among them, summarized as: 1) It is challenging to select the candidate regions for the observed vehicle due to the perspective differences (front or side); 2) The sides of the same type of vehicles are almost the same, resulting in smaller inter-class; 3) Lacking satisfied dataset for vehicle search to meet the practical scenarios; 4) Supervised search publishing methods rely on datasets with expensive annotations. To address these issues, we have established a new vehicle search dataset. We design an unsupervised framework on this benchmark dataset to generate pseudo labels for further training existing vehicle re-ID or person search models. Experimental results reveal that these methods turn less effective on vehicle search tasks. Therefore, the vehicle search task needs to be further developed, and this dataset can advance the research of vehicle search. Https://github.com/zsl1997/VSW.

Abstract:
Cross-modal retrieval is an important multimedia research area which aims to take one type of data as the query to retrieve relevant data of another type. Most of the existing methods follow the paradigm of pair-wise learning and class-level learning to generate a common embedding space, where the similarity of heterogeneous multimodal samples can be calculated. However, in contrast to large-scale cross-modal retrieval applications which often need to tackle multiple modalities, previous studies on cross-modal retrieval mainly focus on two modalities (i.e., text-image or text-video). In addition, for large-scale cross-modal retrieval with modality diversity, another important problem is that the available training data are considerably modality-imbalanced. In this paper, we focus on the challenging problem of modality-imbalanced cross-modal retrieval, and propose a Multimodal Coordinated Clustering Network (MCCN) which consists of two modules, Multimodal Coordinated Embedding (MCE) module to alleviate the imbalanced training data and Multimodal Contrastive Clustering (MCC) module to tackle the imbalanced optimization. The MCE module develops a data-driven approach to coordinate multiple modalities via multimodal semantic graph for the generation of modality-balanced training samples. The MCC module learns class prototypes as anchors to preserve the pair-wise and class-level similarities across modalities for intra-class compactness and inter-class separation, and further introduces intra-class and inter-class margins to enhance optimization flexibility. We conduct experiments on the benchmark multimodal datasets to verify the effectiveness of our proposed method.

Abstract:
With the rapid development of various multimedia applications, research on image compression technology has become particularly important. Learning-based compression methods have developed rapidly and achieved excellent rate-distortion performance. Most existing researches have focused on designing a better entropy model to facilitate the probability estimation without attaching importance to how to extract features from images more effectively. However, information extracted by image compression networks is often not realistic and complete enough, especially when the fixed-shape receptive field of the compression network crosses the texture boundary of an image. In this paper, we propose to extract high-fidelity image features adaptively with local textures as the basic unit, which significantly improves the quality of the extracted information and enhances the compactness of the latent representation of the image. Besides, a cross-information-fusion gate is proposed to fuse the two features extracted from the adaptive image feature extraction branch and the main compression branch for reducing spatial redundancy in the latent representation. Experimental results demonstrate our proposed method achieves superior performance compared to existing learned image compression methods and traditional codecs and produces visually pleasing reconstructed images with high-fidelity details.

Abstract:
Zero-shot sketch-based image retrieval is challenging for the modal gap between distributions of sketches and images and the inconsistency of label spaces during training and testing. Previous methods mitigate the modal gap by projecting sketches and images into a joint embedding space. Most of them also bridge seen and unseen classes by leveraging semantic embeddings, i.e., word vectors and hierarchical similarities. In this paper, we propose Relationship-Preserving Knowledge Distillation (RPKD) to study generalizable embeddings from the perspective of knowledge distillation bypassing the usage of semantic embeddings. In particular, we firstly distill the instance-level knowledge to preserve inter-class relationships without semantic similarities that require extra effort to collect. We also reconcile the contrastive relationships among instances between different embedding spaces, which is complementary to instance-level relationships. Furthermore, embedding-induced supervision, which measures the similarities of an instance to partial class embedding centers from the teacher, is developed to align the student's classification confidences. Extensive experiments conducted on three benchmark ZS-SBIR datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, demonstrate the superiority of our proposed RPKD approach comparing to the state-of-the-art methods.

Abstract:
Wide-baseline image interpolation is useful in many multimedia applications such as virtual street roaming and 3D TV. It is also a challenging problem because the large translations and rotations of image patches make it hard to estimate the motion fields between wide-baseline image pairs. We propose a refinement strategy based on salient error detection to improve the result of existing approaches of wide-baseline image interpolation, where we combine the advantages of methods based on piecewise-linear transformation and methods based on variational model. We first use a lightweight interpolation method to estimate the initial motion field between the input image pair, and synthesize the intermediate image as the initial result. Then we detect regions with noticeable artifacts in the initial image to find areas whose motion vectors should be refined. Finally, we refine the motion field of the detected regions using a variational model based method, and obtain the refined intermediate image. The refinement strategy of our method can be used as the post refinement step for many other image interpolation algorithms. We show the effectiveness and efficiency of our method through experiments on different datasets.

Abstract:
Deep learning has achieved significant success in multimedia fields involving computer vision, natural language processing, and acoustics. However research in adversarial learning also shows that they are highly vulnerable to adversarial examples. Extensive works have demonstrated that adversarial examples could easily fool deep neural networks to wrong predictions threatening practical deep learning applications in both digital and physical world. Though challenging, discovering and harnessing adversarial attacks is beneficial for diagnosing model blind-spots and further understanding as well as improving multimedia systems in practice. In this workshop, we aim to bring together researchers from the fields of adversarial machine learning, model robustness, and explainable AI to discuss recent research and future directions for adversarial robustness of deep learning models, with a particular focus on multimedia applications, including computer vision, acoustics, etc. As far as we know, we are the first workshop to focus on adversarial learning of multimedia deep learning systems, which is of great significance and we hope will be held annually in conjunction with ACM MM.

Abstract:
The second edition of the International Workshop on Multimodal Conversational AI puts forward a diverse set of contributions that aim to brainstorm this new field. Conversational agents are now becoming a commodity as this technology is being applied to a wide range of domains. Healthcare, assisting technologies, e-commerce, information seeking, are some of the domains where multimodal conversational AI is being explored. The wide use of multimodal conversational agents exposes the many challenges in achieving more natural, human-like, and engaging conversational agents. The research contributions of the Workshop actively address several of relevant challenges: How to include assistive-technologies in dialog systems? How can agents engage in negotiation in dialogs? How to handle the embodiment of conversational agents?

Abstract:
In this work, we propose a new framework, called Document Image Transformer (DocTr), to address the issue of geometry and illumination distortion of the document images. Specifically, DocTr consists of a geometric unwarping transformer and an illumination correction transformer. By setting a set of learned query embedding, the geometric unwarping transformer captures the global context of the document image by self-attention mechanism and decodes the pixel-wise displacement solution to correct the geometric distortion. After geometric unwarping, our illumination correction transformer further removes the shading artifacts to improve the visual quality and OCR accuracy. Extensive evaluations are conducted on several datasets, and superior results are reported against the state-of-the-art methods. Remarkably, our DocTr achieves 20.02% Character Error Rate (CER), a 15% absolute improvement over the state-of-the-art methods. Moreover, it also shows high efficiency on running time and parameter count.

Abstract:
Over the last few decades, artificial intelligence research has made tremendous strides, but it still heavily relies on fixed datasets in stationary environments. Continual learning is a growing field of research that examines how AI systems can learn sequentially from a continuous stream of linked data in the same way that biological systems do. Simultaneously, fake media such as deepfakes and synthetic face images have emerged as significant to current multimedia technologies. Recently, numerous method has been proposed which can detect deepfakes with high accuracy. However, they suffer significantly due to their reliance on fixed datasets in limited evaluation settings. Therefore, in this work, we apply continuous learning to neural networks' learning dynamics, emphasizing its potential to increase data efficiency significantly. We propose Continual Representation using Distillation (CoReD) method that employs the concept of Continual Learning (CL), Representation Learning (RL), and Knowledge Distillation (KD). We design CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated synthetic face datasets, while effectively minimizing the catastrophic forgetting in a teacher-student model setting. Our extensive experimental results demonstrate that our method is efficient at domain adaptation to detect low-quality deepfakes videos and GAN-generated images from several datasets, outperforming the-state-of-art baseline methods.

Abstract:
Recent research shows that in dyadic and group interactions individuals' nonverbal behaviours are influenced by the behaviours of their conversational partner(s). Therefore, in this work we hypothesise that during a dyadic interaction, the target subject's facial reactions are driven by two main factors: (i) their internal (person-specific) cognition, and (ii) the externalised nonverbal behaviours of their conversational partner. Subsequently, our novel proposition is to simulate and represent the target subject's (i.e., the listener) cognitive process in the form of a person-specific CNN architecture whose input is the audio-visual non-verbal cues displayed by the conversational partner (i.e., the speaker), and the output is the target subject's (i.e., the listener) facial reactions. We then undertake a search for the optimal CNN architecture whose results are used to create a person-specific graph representation for recognising the target subject's personality. The graph representation, fortified with a novel end-to-end edge feature learning strategy, helps with retaining both the unique parameters of the person-specific CNN and the geometrical relationship between its layers. Consequently, the proposed approach is the first work that aims to recognize the true (self-reported) personality of a target subject (i.e., the listener) from the learned simulation of their cognitive process (i.e., parameters of the person-specific CNN). The experimental results show that the CNN architectures are well associated with target subjects' personality traits and the proposed approach clearly outperforms multiple existing approaches that predict personality directly from non-verbal behaviours. In light of these findings, this work opens up a new avenue of research for predicting and recognizing socio-emotional phenomena (personality, affect, engagement etc.) from simulations of person-specific cognitive processes.

Abstract:
High-level representation-guided pixel denoising and adversarial training are independent solutions to enhance the robustness of CNNs against adversarial attacks by pre-processing input data and re-training models, respectively. Most recently, adversarial training techniques have been widely studied and improved while the pixel denoising-based method is getting less attractive. However, it is still questionable whether there exists a more advanced pixel denoising-based method and whether the combination of the two solutions benefits each other. To this end, we first comprehensively investigate two kinds of pixel denoising methods for adversarial robustness enhancement (i.e., existing additive-based and unexplored filtering-based methods) under the loss functions of image-level and semantic-level, respectively, showing that pixel-wise filtering can obtain much higher image quality (e.g., higher PSNR) as well as higher robustness (e.g., higher accuracy on adversarial examples) than existing pixel-wise additive-based method. However, we also observe that the robustness results of the filtering-based method rely on the perturbation amplitude of adversarial examples used for training. To address this problem, we propose predictive perturbation-aware & pixel-wise filtering, where dual-perturbation filtering and an uncertainty-aware fusion module are designed and employed to automatically perceive the perturbation amplitude during the training and testing process. The method is termed as AdvFilter. Moreover, we combine adversarial pixel denoising methods with three adversarial training-based methods, hinting that considering data and models jointly is able to achieve more robust CNNs. The experiments conduct on NeurIPS-2017DEV, SVHN and CIFAR10 datasets and show advantages over enhancing CNNs' robustness, high generalization to different models and noise levels.

Abstract:
Person re-identification (ReID) aims to re-identify a person from non-overlapping camera views. Since person ReID data contains sensitive personal information, researchers have adopted federated learning, an emerging distributed training method, to mitigate the privacy leakage risks. However, existing studies rely on data labels that are laborious and time-consuming to obtain. We present FedUReID, a federated unsupervised person ReID system to learn person ReID models without any labels while preserving privacy. FedUReID enables in-situ model training on edges with unlabeled data. A cloud server aggregates models from edges instead of centralizing raw data to preserve data privacy. Moreover, to tackle the problem that edges vary in data volumes and distributions, we personalize training in edges with joint optimization of cloud and edge. Specifically, we propose personalized epoch to reassign computation throughout training, personalized clustering to iteratively predict suitable labels for unlabeled data, and personalized update to adapt the server aggregated model to each edge. Extensive experiments on eight person ReID datasets demonstrate that FedUReID not only achieves higher accuracy but also reduces computation cost by 29%. Our FedUReID system with the joint optimization will shed light on implementing federated learning to more multimedia tasks without data labels.

Abstract:
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications, whereas the data of rare fine-grained categories is very limited. Therefore, we propose the few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class. Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions: the inability to capture subtle action details and the inadequacy in learning from data with low inter-class variance. To tackle the first issue, a human vision inspired bidirectional attention module (BAM) is proposed. Combining top-down task-driven signals with bottom-up salient stimuli, BAM captures subtle action details by accurately highlighting informative spatio-temporal regions. To address the second issue, we introduce contrastive meta-learning (CML). Compared with the widely adopted ProtoNet-based method, CML generates more discriminative video representations for low inter-class variance data, since it makes full use of potential contrastive pairs in each training episode. Furthermore, to fairly compare different models, we establish specific benchmark protocols on two large-scale fine-grained action recognition datasets. Extensive experiments show that our method consistently achieves state-of-the-art performance across evaluated tasks.

Abstract:
Locating lesions is important in the computer-aided diagnosis of X-ray images. However, box-level annotation is time-consuming and laborious. How to locate lesions accurately with few, or even without careful annotations is an urgent problem. Although several works have approached this problem with weakly-supervised methods, the performance needs to be improved. One obstacle is that general weakly-supervised methods have failed to consider the characteristics of X-ray images, such as the highly-structural attribute. We therefore propose the Cross-chest Graph (CCG), which improves the performance of automatic lesion detection by imitating doctor's training and decision-making process. CCG models the intra-image relationship between different anatomical areas by leveraging the structural information to simulate the doctor's habit of observing different areas. Meanwhile, the relationship between any pair of images is modeled by a knowledge-reasoning module to simulate the doctor's habit of comparing multiple images. We integrate intra-image and inter-image information into a unified end-to-end framework. Experimental results on the NIH Chest-14 database (112,120 frontal-view X-ray images with 14 diseases) demonstrate that the proposed method achieves state-of-the-art performance in weakly-supervised localization of lesions by absorbing professional knowledge in the medical field.

Abstract:
Crowded scenes human pose estimation remains challenging, which requires joint comprehension of multi-persons and their keypoints in a highly complex scenario. The top-down mechanism, which is a detect-then-estimate pipeline, has become the mainstream solution for general pose estimation and obtained impressive progress. However, simply applying this mechanism to crowded scenes pose estimation results in unsatisfactory performance due to several issues, in particular involving missing keypoints in crowds and ambiguously labeling during training. To tackle above two issues, we introduce a novel method named Semantic-aware Transfer with Instance-adaptive Parsing (STIP). Specifically, our STIP first enhances the discriminative power of pixel-level representations with a semantic-aware mechanism, where it smartly decides which pixels to enhance and what semantic embeddings to add. In this way, the missing keypoints detection can be alleviated.Secondly, instead of adopting a standard regressor with fixed parameters, we propose a new instance-adaptive parsing method, where it dynamically generates instance-specific parameters for reducing adverse effects caused by ambiguously labeling. Notably, STIP is designed in a plugin fashion and it can be integrated into any top-down models, such as HRNet. Extensive experiments on two challenging benchmarks, i.e., CrowdPose and MS-COCO, demonstrate the superiority and generalizability of our approach.

Abstract:
Quick Response (QR) code is a popular form of matrix barcodes that are widely used to tag online links on print media (e.g., posters, leaflets, and books). However, standard QR codes typically appear as noise-like black/white squares (named modules) which seriously disrupt the attractiveness of their carriers. In this paper, we propose StyleCode-Net, a method to generate novel art-style QR codes which can better match the entire style of their carriers to improve the visual quality. For endowing QR codes with artistic elements, a big challenge is that the scanning-robustness must be preserved after transforming colors and textures. To address these issues, we propose a module-based deformable convolutional mechanism (MDCM) and a dynamic target mechanism (DTM) in StyleCode-Net. MDCM can extract the features of black and white modules of QR codes respectively. Then, the extracted features are fed to DTM to balance the scanning-robustness and the style representation. Extensive subjective and objective experiments show that our art-style QR codes have reached the state-of-the-art level in both visual quality and scanning-robustness, and these codes have the potential to replace standard QR codes in real-world applications.

Abstract:
Self-supervised learning (SSL) has been proved very effective in learning representations from unlabeled data in language and vision domains. Yet, very few instrumental self-supervised approaches exist for 3D skeleton action understanding, and directly applying the existing SSL methods from other domains for skeleton action learning may suffer from misalignment of representations and some limitations. In this paper, we consider that a good representation learning encoder can distinguish the underlying features of different actions, which can make the similar motions closer while pushing the dissimilar motions away. There exists, however, some uncertainties in the skeleton actions due to the inherent ambiguity of 3D skeleton pose in different viewpoints or the sampling algorithm in contrastive learning, thus, it is ill-posed to differentiate the action features in the deterministic embedding space. To address these issues, we rethink the distance between action features and propose to model each action representation into the probabilistic embedding space to alleviate the uncertainties upon encountering the ambiguous 3D skeleton inputs. To validate the effectiveness of the proposed method, extensive experiments are conducted on Kinetics, NTU60, NTU120, and PKUMMD datasets with several alternative network architectures. Experimental evaluations demonstrate the superiority of our approach and through which, we can gain significant performance improvement without using extra labeled data.

Abstract:
Visible-infrared person re-identification (VI-ReID) aims to search identities of pedestrians across different spectra. In this task, one of the major challenges is the modality discrepancy between the visible (VIS) and infrared (IR) images. Some state-of-the-art methods try to design complex networks or generative methods to mitigate the modality discrepancy while ignoring the highly non-linear relationship between the two modalities of VIS and IR. In this paper, we propose a non-linear middle modality generator (MMG), which helps to reduce the modality discrepancy. Our MMG can effectively project VIS and IR images into a unified middle modality image (UMMI) space to generate middle-modality (M-modality) images. The generated M-modality images and the original images are fed into the backbone network to reduce the modality discrepancy.Furthermore, in order to pull together the two types of M-modality images generated from the VIS and IR images in the UMMI space, we propose a distribution consistency loss (DCL) to make the modality distribution of the generated M-modalities images as consistent as possible. Finally, we propose a middle modality network (MMN) to further enhance the discrimination and richness of features in an explicit manner. Extensive experiments have been conducted to validate the superiority of MMN for VI-ReID over some state-of-the-art methods on two challenging datasets. The gain of MMN is more than 11.1% and 8.4% in terms of Rank-1 and mAP, respectively, even compared with the latest state-of-the-art methods on the SYSU-MM01 dataset.

Abstract:
Motivated by resource-limited scenarios, knowledge distillation (KD) has received growing attention, effectively and quickly producing lightweight yet high-performance student models by transferring the dark knowledge from large teacher models. However, many pre-trained teacher models are downloaded from public platforms that lack necessary vetting, posing a possible threat to knowledge distillation tasks. Unfortunately, thus far, there has been little research to consider the backdoor attack from the teacher model into student models in KD, which may pose a severe threat to its wide use. In this paper, we, for the first time, propose a novel Anti-Distillation Backdoor Attack (ADBA), in which the backdoor embedded in the public teacher model can survive the knowledge distillation process and thus be transferred to secret distilled student models. We first introduce a shadow to imitate the distillation process and adopt an optimizable trigger to transfer information to help craft the desired teacher model. Our attack is powerful and effective, which achieves 95.92%, 94.79%, and 90.19% average success rates of attacks (SRoAs) against several different structure student models on MNIST, CIFAR-10, and GTSRB, respectively. Our ADBA also performs robustly under different user distillation environments with 91.72% and 92.37% average SRoAs on MNIST and CIFAR-10, respectively. Finally, we show that the ADBA has a low overhead in the injecting process, which converges on 50 and 70 epochs on CIFAR-10 and GTSRB, respectively, while the normal training epochs of these datasets are almost 200.

Abstract:
Facial action unit (AU) recognition has attracted increasing attention due to its indispensable role in affective computing, especially in the field of affective human-computer interaction. Due to the subtle and transient nature of AU, it is challenging to capture the delicate and ambiguous motions in local facial regions among consecutive frames. Considering that context is essential to resolve ambiguity in human visual system, modeling context within or among facial images emerges as a promising approach for AU recognition task. To this end, we propose CaFGraph, a novel context-aware facial multi-graph that can model both morphological & muscular-based region-level local context and region-level temporal context. CaFGraph is the first work to construct a universal facial multi-graph structure that is independent of both task settings and dataset statistics for almost all fine-grained facial behavior analysis tasks, including but not limited to AU recognition. To make full use of the context, we then present CaFNet that learns context-aware facial graph representations via CaFGraph from facial images for multi-label AU recognition. Experiments on two widely used benchmark datasets, BP4D and DISFA, demonstrate the superiority of our CaFNet over the state-of-the-art methods.

Abstract:
Multi-Object Tracking (MOT) and Person Search both demand to localize and identify specific targets from raw image frames. Existing methods can be classified into two categories, namely two-step strategy and end-to-end strategy. Two-step approaches have high accuracy but suffer from costly computations, while end-to-end methods show greater efficiency with limited performance. In this paper, we dissect the gap between two-step and end-to-end strategy and propose a simple yet effective end-to-end framework with knowledge distillation. Our proposed framework is simple in concept and easy to benefit from external datasets. Experimental results demonstrate that our model performs competitively with other sophisticated two-step and end-to-end methods in multi-object tracking and person search.

Abstract:
Diagram question answering (DQA) is an effective way to evaluate the reasoning ability for diagram semantic understanding, which is a very challenging task and largely understudied compared with natural images. Existing separate two-stage methods for DQA are limited in ineffective feedback mechanisms. To address this problem, in this paper, we propose a novel structural parsing-integrated Hierarchical Multi-Task Learning (HMTL) model for diagram question answering based on a multi-modal transformer framework. In the proposed paradigm of multi-task learning, the two tasks of diagram structural parsing and question answering are in the different semantic levels and equipped with different transformer blocks, which constituents a hierarchical architecture. The structural parsing module encodes the information of constituents and their relationships in diagrams, while the diagram question answering module decodes the structural signals and combines question-answers to infer correct answers. Visual diagrams and textual question-answers are interplayed in the multi-modal transformer, which achieves cross-modal semantic comprehension and reasoning. Extensive experiments on the benchmark AI2D and FOODWEBS datasets demonstrate the effectiveness of our proposed HMTL over other state-of-the-art methods.

Abstract:
People talk with diversified styles. For one piece of speech, different talking styles exhibit significant differences in the facial and head pose movements. For example, the "excited" style usually talks with the mouth wide open, while the "solemn" style is more standardized and seldomly exhibits exaggerated motions. Due to such huge differences between different styles, it is necessary to incorporate the talking style into audio-driven talking face synthesis framework. In this paper, we propose to inject style into the talking face synthesis framework through imitating arbitrary talking style of the particular reference video. Specifically, we systematically investigate talking styles with our collected Ted-HD dataset and construct style codes as several statistics of 3D morphable model (3DMM) parameters. Afterwards, we devise a latent-style-fusion (LSF) model to synthesize stylized talking faces by imitating talking styles from the style codes. We emphasize the following novel characteristics of our framework: (1) It doesn't require any annotation of the style, the talking style is learned in an unsupervised manner from talking videos in the wild. (2) It can imitate arbitrary styles from arbitrary videos, and the style codes can also be interpolated to generate new styles. Extensive experiments demonstrate that the proposed framework has the ability to synthesize more natural and expressive talking styles compared with baseline methods.

Abstract:
Person Re-Identification (Re-Id) in occlusion scenarios is a challenging problem because a pedestrian can be partially occluded. The use of local information for feature extraction and matching is still necessary. Therefore, we propose a Pose-guided inter- and intra-part relational transformer (Pirt) for occluded person Re-Id, which builds part-aware long-term correlations by introducing transformer. In our framework, we firstly develop a pose-guided feature extraction module with regional grouping and mask construction for robust feature representations. The positions of a pedestrian in the image under surveillance scenarios are relatively fixed, hence we propose intra-part and inter-part relational transformer. The intra-part module creates local relations with mask-guided features, while the inter-part relationship builds correlations with transformers, to develop cross relationships between part nodes. With the collaborative learning inter- and intra-part relationships, experiments reveal that our proposed Pirt model achieves a new state of the art on the public occluded dataset, and further extensions on standard non-occluded person Re-Id datasets also reveal our comparable performances.

Abstract:
Cross-modal retrieval has received considerable attention owing to its applicability to enable users to search desired information with diversified forms. Existing retrieval methods retain good performance mainly relying on complex deep neural networks and high-quality supervision signals, which deters them from real-world resource-constrained development and deployment. In this paper, we propose an effective unsupervised learning framework named JOint-teachinG (JOG) to pursue a high-performance yet light-weight cross-modal retrieval model. The key idea is to utilize the knowledge of a pre-trained model (a.k.a. the "teacher") to endow the to-be-learned model (a.k.a. the "student") with strong feature learning ability and predictive power. Considering that a teacher model serving the same task as the student is not always available, we resort to a cross-task teacher to leverage transferrable knowledge to guide student learning. To eliminate the inevitable noises in the distilled knowledge resulting from the task discrepancy, an online knowledge-refinement strategy is designed to progressively improve the quality of the cross-task knowledge in a joint-teaching manner, where a peer student is engaged. In addition, the proposed JOG learns to represent the original high-dimensional data with compact binary codes to accelerate the query processing, further facilitating resource-limited retrieval. Through extensive experiments, we demonstrate that in various network structures, the proposed method can yield promising learning results on widely-used benchmarks. The proposed research is a pioneering work for resource-constrained cross-modal retrieval, which has strong potential to be applied to on-device deployment and is hoped to pave the way for further study.

Abstract:
With interactive branched video, the storyline is typically determined by branch choices made by the user during playback. Despite putting users in control of their viewing experiences, prior work has not considered how to best help users that may want to quickly navigate, explore, or skip parts of the branched video. Such functionalities are important for both impatient users and those rewatching the video. To address this void, we present the design, implementation and evaluation of interface solutions that help users effectively navigate the video, and to identify and explore previously unviewed storylines. Our solutions work with large, general video structures and allow users to effectively forward/rewind the branched structures. Our user study demonstrates the added value of our novel designs, presents promising tradeoffs, provides insights into the pros/cons of different design alternatives, and highlights the features that best address specific tasks and design aspects.

Abstract:
Existing light field based works utilize either views or focal stacks for saliency detection. However, since depth information exists implicitly in adjacent views or different focal slices, it is difficult to exploit scene depth information from both. By comparison, Epipolar Plane Images (EPIs) provide explicit accurate scene depth and occlusion information by projected pixel lines. Due to the fact that the depth of an object is often continuous, the distribution of occlusion edges concentrates more on object boundaries compared with traditional color edges, which is more beneficial for improving accuracy and completeness of saliency detection. In this paper, we propose a learning-based network to exploit occlusion features from EPIs and integrate high-level features from the central view for accurate salient object detection. Specifically, a novel Occlusion Extraction Module is proposed to extract occlusion boundary features from horizontal and vertical EPIs. In order to naturally combine occlusion features in EPIs and high-level features in central view, we design a concise Bi-directional Guiding Flow based on cascaded decoders. The flow leverages generated salient edge predictions and salient object predictions to refine features in mutual encoding processes. Experimental results demonstrate that our approach achieves state-of-the-art performance in both segmentation accuracy and edge clarity.

Abstract:
We present a generative adversarial network to synthesize 3D pose sequences of co-speech upper-body gestures with appropriate affective expressions. Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences. We leverage the Mel-frequency cepstral coefficients and the text transcript computed from the input speech in separate encoders in our generator to learn the desired sentiments and the associated affective cues. We design an affective encoder using multi-scale spatial-temporal graph convolutions to transform 3D pose sequences into latent, pose-based affective features. We use our affective encoder in both our generator, where it learns affective features from the seed poses to guide the gesture synthesis, and our discriminator, where it enforces the synthesized gestures to contain the appropriate affective expressions. We perform extensive evaluations on two benchmark datasets for gesture synthesis from the speech, the TED Gesture Dataset and the GENEA Challenge 2020 Dataset. Compared to the best baselines, we improve the mean absolute joint error by 10-33%, the mean acceleration difference by 8-58%, and the Fréchet Gesture Distance by 21-34%. We also conduct a user study and observe that compared to the best current baselines, around 15.28% of participants indicated our synthesized gestures appear more plausible, and around 16.32% of participants felt the gestures had more appropriate affective expressions aligned with the speech.

Abstract:
Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Especially, deep neural network (DNN) methods have significantly reduced estimation errors for crowd counting missions. Recent studies have demonstrated that DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible perturbations could mislead DNNs to make false predictions. In this work, we propose a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically evaluate the robustness of crowd counting models, where the attacker's goal is to create an adversarial perturbation that severely degrades their performances, thus leading to public safety accidents (e.g., stampede accidents). Especially, the proposed attack leverages the extreme-density background information of input images to generate robust adversarial patches via a series of transformations (e.g., interpolation, rotation, etc.). We observe that by perturbing less than 6% of image pixels, our attacks severely degrade the performance of crowd counting systems, both digitally and physically. To better enhance the adversarial robustness of crowd counting models, we propose the first regression model-based Randomized Ablation (RA), which is more sufficient than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on clean samples and 30 lower than ADT on adversarial examples). Extensive experiments on five crowd counting models demonstrate the effectiveness and generality of the proposed method.

Abstract:
Conventional exemplar based image colorization tends to transfer colors from reference image only to grayscale image based on the semantic correspondence between them. But their practical capabilities are limited when semantic correspondence can hardly be found. To overcome this issue, additional information, such as colors from the database is normally introduced. However, it's a great challenge to consider color information from reference image and database simultaneously because there lacks a unified framework to model different color information and the multi-modal ambiguity in database cannot be removed easily. Also, it is difficult to fuse different color information effectively. Thus, a general attention based colorization framework is proposed in this work, where the color histogram of reference image is adopted as a prior to eliminate the ambiguity in database. Moreover, a sparse loss is designed to guarantee the success of information fusion. Both qualitative and quantitative experimental results show that the proposed approach achieves better colorization performance compared with the state-of-the-art methods on public databases with different quality metrics.

Abstract:
Recently deep learning-based depth estimation has shown the promising result, especially with the help of sparse depth reference samples. Existing works focus on directly inferring the depth information from sparse samples with high confidence. In this paper, we propose a Heuristic Depth Estimation Network (HDEN) with progressive depth reconstruction and confidence-aware loss. The HDEN leverages the reference samples with low confidence to distill the spatial geometric and local semantic information for dense depth prediction. Specifically, we first train a U-NET network to generate a coarse-level dense reference map. Second, the progressive depth reconstruction module successively reconstructs the fine-level dense depth map from different scales, where a multi-level upsampling block is designed to recover the local structure of object. Finally, the confidence-aware loss is proposed to trigger the reference samples with low confidence, which enforces the model focusing on estimating the depth of the tiny structure. Extensive experiments on the NYU-Depth-v2 and KITTI-Odometry dataset show the effectiveness of our method. Visualization results demonstrate that the dense depth maps generated by HDEN have better consistency at the entity edge with RGB image.

Abstract:
Existing methods towards outfit compatibility modeling seldom explicitly consider multimodal correlations. In this work, we explore the consistent and complementary correlations for better compatibility modeling. This is, however, non-trivial due to the following challenges: 1) how to separate and model these two kinds of correlations; 2) how to leverage the derived complementary cues to strengthen the text and vision-oriented representations of the given item; and 3) how to reinforce the compatibility modeling with text and vision-oriented representations. To address these challenges, we present a comprehensive multimodal outfit compatibility modeling scheme. It first nonlinearly projects each modality into separable consistent and complementary spaces via multi-layer perceptron, and then models the consistent and complementary correlations between two modalities by parallel and orthogonal regularization. Thereafter, we strengthen the visual and textual representation of items with complementary information, and further induct both the text-oriented and vision- oriented outfit compatibility modeling. We ultimately employ the mutual learning strategy to reinforce the final performance of compatibility modeling. Extensive experiments demonstrate the superiority of our scheme.

Abstract:
Multi-label image classification aims to predict multiple labels for a single image. However, the difficulties of predicting different labels may vary dramatically due to semantic variations of the label as well as the image context. Direct learning of multi-label classification models has the risk of being biased and overfitting those difficult labels, e.g., deep network based classifiers are over-trained on the difficult labels, therefore, lead to false-positive errors of those difficult labels during testing. To handle difficult labels of multi-label image classification, we propose to calibrate the model, which not only predicts the labels but also estimates the uncertainty of the prediction. With the new calibration branch of the network, the classification model is trained with the pick-all-labels normalized loss and optimized pertaining to the number of positive labels. Moreover, to improve performance on difficult labels, instead of annotating them, we leverage the calibrated model as the teacher network and teach the student network about handling difficult labels via uncertainty distillation. Our proposed uncertainty distillation teaches the student network which labels are highly uncertain through prediction distribution distillation, and locates the image regions that cause such uncertain predictions through uncertainty attention distillation. Conducting extensive evaluations on benchmark datasets, we demonstrate that our proposed uncertainty distillation is valuable to handle difficult labels of multi-label image classification.

Abstract:
Augmented Reality (AR) offers new capabilities for blurring the boundaries between physical reality and digital media. However, the capabilities of integrating web contents and AR remain underexplored. This paper presents an AR web browser with an integrated context-aware AR-to-Web content recommendation service named as A2W browser, to provide continuously user-centric web browsing experiences driven by AR headsets. We implement the A2W browser on an AR headset as our demonstration application, demonstrating the features and performance of A2W framework. The A2W browser visualizes the AR-driven web contents to the user, which is suggested by the content-based filtering model in our recommendation system. In our experiments, 20 participants with the adaptive UIs and recommendation system in A2W browser achieve up to 30.69% time saving compared to smartphone conditions. Accordingly, A2W-supported web browsing on workstations facilitates the recommended information leading to 41.67% faster reaches to the target information than typical web browsing.

Abstract:
Head pose estimation is a crucial problem that involves the prediction of the Euler angles of a human head in an image. Previous approaches predict head poses through landmarks detection, which can be applied to multiple downstream tasks. However, previous landmark-based methods can not achieve comparable performance to the current landmark-free methods due to lack of modeling the complex nonlinear relationships between the geometric distribution of landmarks and head poses. Another reason for the performance bottleneck is that there exists biased underlying distribution of the 3D pose angles in the current head pose benchmarks. In this work, we propose OsGG-Net, a One-step Graph Generation Network for estimating head poses from a single image by generating a landmark-connection graph to model the 3D angle associated with the landmark distribution robustly. To further ease the angle-biased issues caused by the biased data distribution in learning the graph structure, we propose the UnBiased Head Pose Dataset, called UBHPD, and a new unbiased metric, namely UBMAE, for unbiased head pose estimation. We conduct extensive experiments on various benchmarks and UBHPD where our method achieves the state-of-the-art results in terms of the commonly-used MAE metric and our proposed UBMAE. Comprehensive ablation studies also demonstrate the effectiveness of each part in our approach.

Abstract:
The latest advances in full-reference image quality assessment (IQA) involve unifying structure and texture similarity based on deep representations. The resulting Deep Image Structure and Texture Similarity (DISTS) metric, however, makes rather global quality measurements, ignoring the fact that natural photographic images are locally structured and textured across space and scale. In this paper, we describe a locally adaptive structure and texture similarity index for full-reference IQA, which we term A-DISTS. Specifically, we rely on a single statistical feature, namely the dispersion index, to localize texture regions at different scales. The estimated probability (of one patch being texture) is in turn used to adaptively pool local structure and texture measurements. The resulting A-DISTS is adapted to local image content, and is free of expensive human perceptual scores for supervised training. We demonstrate the advantages of A-DISTS in terms of correlation with human data on ten IQA databases and optimization of single image super-resolution methods.

Abstract:
Nowadays, cooking recipe sharing sites on the Web are widely used, and play a major role in everyday home cooking. Since cooking recipes consist of dish photos and recipe texts, cross-modal recipe search is being actively explored. To enable cross-modal search, both food image features and cooking text recipe features are embedded into the same shared space in general. However, in most of the existing studies, a one-to-one correspondence between a recipe text and a dish image in the embedding space is assumed, although an unlimited number of photos with different serving styles and different plates can be associated with the same recipe. In this paper, we propose a RDE-GAN (Recipe Disentangled Embedding GAN) which separates food image information into a recipe image feature and a non-recipe shape feature. In addition, we generate a food image by integrating both the recipe embedding and a shape feature. Since the proposed embedding is free from serving and plate styles which are unrelated to cooking recipes, the experimental results showed that it outperformed the existing methods on cross-modal recipe search. We also confirmed that only either shape or recipe elements can be changed at the time of food image generation.

Abstract:
Though convolutional neural networks are widely used in different tasks, lack of generalization capability in the absence of sufficient and representative data is one of the challenges that hinders their practical application. In this paper, we propose a simple, effective, and plug-and-play training strategy named Knowledge Distillation for Domain Generalization (KDDG) which is built upon a knowledge distillation framework with the gradient filter as a novel regularization term. We find that both the "richer dark knowledge" from the teacher network, as well as the gradient filter we proposed, can reduce the difficulty of learning the mapping which further improves the generalization ability of the model. We also conduct experiments extensively to show that our framework can significantly improve the generalization capability of deep neural networks in different tasks including image classification, segmentation, reinforcement learning by comparing our method with existing state-of-the-art domain generalization techniques. Last but not the least, we propose to adopt two metrics to analyze our proposed method in order to better understand how our proposed method benefits the generalization capability of deep neural networks.

Abstract:
In the business domain,bundling is one of the most important marketing strategies to conduct product promotions, which is commonly used in online e-commerce and offline retailers. Existing recommender systems mostly focus on recommending individual items that users may be interested in, such as the considerable research work on collaborative filtering that directly models the interaction between users and items. In this paper, we target at a practical but less explored recommendation problem named personalized bundle composition, which aims to offer an optimal bundle (i.e., a combination of items) to the target user. To tackle this specific recommendation problem, we formalize it as a combinatorial optimization problem on a set of candidate items and solve it within a neural combinatorial optimization framework. Extensive experiments on public datasets are conducted to demonstrate the superiority of the proposed method.

Abstract:
In 6D object pose estimation task, object models are usually available and represented as the point cloud set in canonical object frame, which are important references for estimating object poses to the camera frame. However, directly introducing object models as the prior knowledge (i.e., object model point cloud) will cause potential perturbations and even degenerate pose estimation performance. To make the most of object model priors and eliminate the problem, we present an end-to-end deep learning approach called the Geometric Constraint Co-attention Network (GCCN) for 6D object pose estimation. GCCN is designed to explicitly leverage the object model priors effectively with the co-attention mechanism. We add explicit geometric constraints to a co-attention module to inform the geometric correspondence relationships between points in the scene and object model priors and develop a novel geometric constraint loss to guide the training. In this manner, our method effectively eliminates the side effect of directly introducing the object model priors into the network. Experiments on the YCB-Video and LineMOD datasets demonstrate that our GCCN substantially improves the performance of pose estimation and is robust against heavy occlusions. We also demonstrate that GCCN is accurate and robust enough to be deployed in real-world robotic tasks.

Abstract:
One of the challenges of non-face-to-face communication is the absence of the haptic dimension. To solve this, a haptic communication system via the Internet has been proposed. The system has to be designed in such a way that it does not create discomfort during general use. The "Sync Glass" that we have developed transmits and presents the feeling of pouring a drink and making a toast accompanied by haptic, sound and visual effects. The device is designed to resemble a glass cup and, moreover, each action, including drinking and making a toast is performed in the customary way, making its use more acceptable to users. In the internal user demonstrations we performed, the experience has been reviewed with participants saying that "the feeling of pouring is so realistic", "so enjoyable!", and similar affirmative statements.

Abstract:
An increasing number of real-world applications are using graph-structured datasets, imposing challenges to existing machine learning algorithms. Graph Convolutional Networks (GCNs) are deep learning models, specifically designed to operate on graphs. One of the most tedious steps in training GCNs is the choice of the hyperparameters, especially since they exhibit unique properties compared to other neural models. Not only machine learning beginners, but also experienced practitioners often have difficulties to properly tune their models. We hypothesize that having a tool that visualizes the effect of hyperparameters choice on the performance can accelerate the model development and improve the understanding of these black-box models. Additionally, observing clusters of certain nodes helps to empirically understand how a given prediction was made due to the feature propagation step of GCNs. Therefore, this demo introduces GCNIllustrator - a web-based visual analytics tool for illustrating the effect of hyperparameters on the predictions in a citations graph.

Abstract:
In this paper, we present a disentangled representation learning and enhancement network (DRLE-Net) to address the challenging single image de-raining problems, i.e., raindrop and rain streak removal. Specifically, the DRLE-Net is formulated as a multi-task learning framework, and an elegant knowledge transfer strategy is designed to train the encoder of DRLE-Net to embed a rainy image into two separated latent spaces representing the task (clean image reconstruction in this paper) relevant and irrelevant variations respectively, such that only the essential task-relevant factors will be used by the decoder of DRLE-Net to generate high-quality de-raining results. Furthermore, visual attention information is modeled and fed into the disentangled representation learning network to enhance the task-relevant factor learning. To facilitate the optimization of the hierarchical network, a new adversarial loss formulation is proposed and used together with the reconstruction loss to train the proposed DRLE-Net. Extensive experiments are carried out for removing raindrops or rainstreaks from both synthetic and real rainy images, and DRLE-Net is demonstrated to produce significantly better results than state-of-the-art models.

Abstract:
Incremental learning of semantic segmentation has emerged as a promising strategy for visual scene interpretation in the open-world setting. However, it remains challenging to acquire novel classes in an online fashion for the segmentation task, mainly due to its continuously-evolving semantic label space, partial pixelwise ground-truth annotations, and constrained data availability. To address this, we propose an incremental learning strategy that can fast adapt deep segmentation models without catastrophic forgetting, using a streaming input data with pixel annotations on the novel classes only. To this end, we develop a unified learning strategy based on the Expectation-Maximization (EM) framework, which integrates an iterative relabeling strategy that fills in the missing labels and a rehearsal-based incremental learning step that balances the stability-plasticity of the model. Moreover, our EM algorithm adopts an adaptive sampling method to select informative training data and a class-balancing training strategy in the incremental model updates, both improving the efficacy of model learning. We validate our approach on the PASCAL VOC 2012 and ADE20K datasets, and the results demonstrate its superior performance over the existing incremental methods.

Abstract:
Deep learning has achieved notable performance in the denoising task of low-quality medical images and the detection task of lesions, respectively. However, existing low-quality medical image denoising approaches are disconnected from the detection task of lesions. Intuitively, the quality of denoised images will influence the lesion detection accuracy that in turn can be used to affect the denoising performance. To this end, we propose a play-and-plug medical image denoising framework, namely Lesion-Inspired Denoising Network (LIDnet), to collaboratively improve both denoising performance and detection accuracy of denoised medical images. Specifically, we propose to insert the feedback of downstream detection task into existing denoising framework by jointly learning a multi-loss objective. Instead of using perceptual loss calculated on the entire feature map, a novel region-of-interest (ROI) perceptual loss induced by the lesion detection task is proposed to further connect these two tasks. To achieve better optimization for overall framework, we propose a customized collaborative training strategy for LIDnet. On consideration of clinical usability and imaging characteristics, three low-dose CT images datasets are used to evaluate the effectiveness of the proposed LIDnet. Experiments show that, by equipping with LIDnet, both of the denoising and lesion detection performance of baseline methods can be significantly improved.

Abstract:
In this paper, we focus on the composed query image retrieval task, namely retrieving the target images that are similar to a composed query, in which a modification text is combined with a query image to describe a user's accurate search intention. Previous methods usually focus on learning the joint image-text representations, but rarely consider the intrinsic relationship among the query image, the target image and the modification text. To address this problem, we propose a new cross-modal joint prediction and alignment framework for composed query image retrieval. In our framework, the modification text is regarded as an implicit transformation between the query image and the target image. Motivated by that, not only the combination of the query image and modification text should be similar to the target image, but also the modification text should be predicted according to the query image and the target image. We devote to aligning this relationship by a novel Joint Prediction Module (JPM). Our proposed framework can seamlessly incorporate the JPM into the existing methods to effectively improve the discrimination and robustness of visual and textual representations. The experiments on three public datasets demonstrate the effectiveness of our proposed framework, proving that our proposed JPM can be simply incorporated with the existing methods while effectively improving the performance.

Abstract:
In computer-assisted vascular surgery, real-time multi-instrument segmentation serves as a pre-requisite step. However, a large amount of effort has been dedicated to single-instrument rather than multi-instrument in computer-assisted intervention research to this day. To fill the overlooked gap, this study introduces a Light-Weight Deep Feature Refinement Network (DFR-Net) based on multi-task learning for real-time multi-instrument segmentation. In this network, the proposed feature refinement module (FRM) can capture long-term dependencies while retaining precise positional information, which helps model locate the foreground objects of interest. The designed channel calibration module (CCM) can re-calibrate fusion weights of multi-level features, which helps model balance the importance of semantic information and appearance information. Besides, the connectivity loss function is developed to address fractures in the wire-like structure segmentation results. Extensive experiments on two different types of datasets consistently demonstrate that DFR-Net can achieve state-of-the-art segmentation performance while meeting the real-time requirements.

Abstract:
Recognizing the emotional state of people is a basic but challenging task in video understanding. In this paper, we propose a new task in this field, named Pairwise Emotional Relationship Recognition (PERR). This task aims to recognize the emotional relationship between the two interactive characters in a given video clip. It is different from the traditional emotion and social relation recognition task. Varieties of information, consisting of character appearance, behaviors, facial emotions, dialogues, background music as well as subtitles contribute differently to the final results, which makes the task more challenging but meaningful in developing more advanced multi-modal models. To facilitate the task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal dataset for PERR task, which has 31,182 video clips, lasting about 203 video hours. Different from the existing datasets, ERATO contains interaction-centric videos with multi-shots, varied video length, and multiple modalities including visual, audio and text. As a minor contribution, we propose a baseline model composed of Synchronous Modal-Temporal Attention (SMTA) unit to fuse the multi-modal information for the PERR task. In contrast to other prevailing attention mechanisms, our proposed SMTA can steadily improve the performance by about 1%. We expect the ERATO as well as our proposed SMTA to open up a new way for PERR task in video understanding and further improve the research of multi-modal fusion methodology.

Abstract:
Cycling is on the rise as a relevant alternative to car-based mobility and even though there are mobile applications specifically designed for cyclists to support this development, many still face unresolved challenges in terms of safe user interaction with complex data while riding. We present the design, development, and evaluation of VeloCity - an application for reporting traffic incidents and structures relevant to cyclists. In a case study, we compared its' three input methods (touch, in-app speech recognition, the voice assistant of the operating system) to evaluate which attributes make for safe interaction while cycling. We found that participants prefer to use the voice assistant over the other modalities as it was least distracting due to its hands- and eyes-free interaction design. Furthermore, they chose short commands over conversational phrases. Based on our results, we present five guidelines for designing voice user interfaces for cyclists and argue for moving away from touch-based interfaces in this domain, which still make up most of the applied interaction techniques today.

Abstract:
Perceptual encryption is an efficient way of protecting image content by only selectively encrypting a portion of significant data in plain images. Existing security analysis of perceptual encryption usually resorts to traditional cryptanalysis techniques, which require heavy manual work and strict prior knowledge of encryption schemes. In this paper, we introduce a new end-to-end method of analyzing the visual security of perceptually encrypted images, without any manual work or knowing any prior knowledge of the encryption scheme. Specifically, by leveraging convolutional neural networks (CNNs), we propose a progressive recovery network (PRNet) to recover visual content from perceptually encrypted images. Our PRNet is stacked with several dense attention recovery blocks (DARBs), where each DARB contains two branches: feature extraction branch and image recovery branch. These two branches cooperate to rehabilitate more detailed visual information and generate efficient feature representation via densely connected structure and dual-saliency mechanism. We conduct extensive experiments to demonstrate that PRNet works on different perceptual encryption schemes with different settings, and the results show that PRNet significantly outperforms the state-of-the-art CNN-based image restoration methods.

Abstract:
In recent years, virtual makeup applications have become more and more popular. However, it is still challenging to propose a robust makeup transfer method in the real-world environment. Current makeup transfer methods mostly work well on good-conditioned clean makeup images, but transferring makeup that exhibits shadow and occlusion is not satisfying. To alleviate it, we propose a novel makeup transfer method, called 3D-Aware Shadow and Occlusion Robust GAN (SOGAN). Given the source and the reference faces, we first fit a 3D face model and then disentangle the faces into shape and texture. In the texture branch, we map the texture to the UV space and design a UV texture generator to transfer the makeup. Since human faces are symmetrical in the UV space, we can conveniently remove the undesired shadow and occlusion from the reference image by carefully designing a Flip Attention Module (FAM). After obtaining cleaner makeup features from the reference image, a Makeup Transfer Module (MTM) is introduced to perform accurate makeup transfer. The qualitative and quantitative experiments demonstrate that our SOGAN not only achieves superior results in shadow and occlusion situations but also performs well in large pose and expression variations.

Abstract:
We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual and semantic-based artistic representations. First, we illustrate the significant advantages of multi-task learning for fine art analysis and argue that it is conceptually a much more appropriate setting in the fine art domain than the single-task alternatives. We further demonstrate that several GNN architectures can outperform strong CNN baselines in a range of fine art analysis tasks, such as style classification, artist attribution, creation period estimation, and tag prediction, while training them requires an order of magnitude less computational time and only a small amount of labeled data. Finally, through extensive experimentation we show that our proposed ArtSAGENet captures and encodes valuable relational dependencies between the artists and the artworks, surpassing the performance of traditional methods that rely solely on the analysis of visual content. Our findings underline a great potential of integrating visual content and semantics for fine art analysis and curation.

Abstract:
High-fidelity multi-singer singing voice synthesis is challenging for neural vocoder due to the singing voice data shortage, limited singer generalization, and large computational cost. Existing open corpora could not meet requirements for high-fidelity singing voice synthesis because of the scale and quality weaknesses. Previous vocoders have difficulty in multi-singer modeling, and a distinct degradation emerges when conducting unseen singer singing voice generation. To accelerate singing voice researches in the community, we release a large-scale, multi-singer Chinese singing voice dataset OpenSinger. To tackle the difficulty in unseen singer modeling, we propose Multi-Singer, a fast multi-singer vocoder with generative adversarial networks. Specifically, 1) Multi-Singer uses a multi-band generator to speed up both training and inference procedure. 2) to capture and rebuild singer identity from the acoustic feature (i.e., mel-spectrogram), Multi-Singer adopts a singer conditional discriminator and conditional adversarial training objective. 3) to supervise the reconstruction of singer identity in the spectrum envelopes in frequency domain, we propose an auxiliary singer perceptual loss. The joint training approach effectively works in GANs for multi-singer voices modeling. Experimental results verify the effectiveness of OpenSinger and show that Multi-Singer improves unseen singer singing voices modeling in both speed and quality over previous methods. The further experiment proves that combined with FastSpeech 2 as the acoustic model, Multi-Singer achieves strong robustness in the multi-singer singing voice synthesis pipeline.

Abstract:
Point cloud video provides a more immersive holographic virtual experience than conventional video services such as 360 degree video and virtual reality (VR) video. However, the existing network bandwidth and transmission technology can not carry real-time point cloud video streaming due to mass data volume, high processing overheads, and extremely bandwidth-consuming. Unlike previous approaches that extend the VR video streaming, we propose AITransfer, an AI-powered bandwidth-aware and adaptive transmission technique driven by extracting and transferring key point cloud features to reduce the bandwidth consumption and alleviate the computational pressure. AITransfer has two outstanding contributions, including (1) incorporating the dynamic network bandwidth into the design of an end-to-end architecture with two fundamental contents of feature extraction and reconstruction, and (2) employing an online adapter to sense the network bandwidth and match the optimal inference model. We conduct extensive experiments on the typical dataset and develop a case study to demonstrate the efficiency and effectiveness. The results show that AITransfer can provide more than 30.72 times compression ratio under the existing network environments.

Abstract:
Tile-based approach is widely adopted in adaptive 360\textdegree~video streaming systems. Existing QoE-driven streaming approaches usually obtain the tile selection and adjust the bitrate based on the viewport prediction with a fixed tiling, which fail to consider the unstable prediction performance. However, varying the tiling of the video can produce different number of tiles with different sizes, and thus can have distinct impacts on error tolerance for viewport prediction and on decoding complexity for resource-constrained mobile client. In this work, we introduce adaptive tiling into the conventional bitrate adaptation for mobile 360degree~video streaming. We first analyze the impacts of tilings on tile selection and decoding time, which verify the benefit of tiling adaptation in various practical aspects. We then formulate the QoE optimization problem for adaptive tiling and bitrate streaming and discuss the design details of our adaptation algorithm, which can adapt to the performance of viewport prediction and the decoding capabilities of mobile clients in addition to the conventional influencing factors. Finally, the superiority of our proposed approach compared with the state-of-the-art methods is evaluated through extensive trace-driven simulations.

Abstract:
Layout representation, which models visual elements and their inter-relations in a canvas, plays a crucial role in graphic design intelligence. With a large variety of layout designs and the unique characteristic of layouts that visual elements are defined as a list of categorical (e.g., type) and numerical (e.g., position and size) properties, it is challenging to learn general and compact representations with limited data. Inspired by the recent success of self-supervised pre-training techniques in various natural language processing tasks, in this paper, we propose CanvasEmb (Canvas Embedding), which pre-trains deep representations from unlabeled graphic designs by jointly conditioning on all the context elements in a canvas, with a multi-dimensional feature encoder and a multi-task learning objective. The pre-trained CanvasEmb model can be fine-tuned with just one additional output layer and with a small size of training data to create models for a wide range of downstream tasks. We verify our approach with presentation slides data. We construct a large-scale dataset with more than one million slides and propose two layout understanding tasks with human-labeled sets, namely element role labeling and image captioning. Evaluation results on these two tasks show that our model with fine-tuning achieves state-of-the-art performance. Furthermore, we conduct a deep analysis aiming to understand the modeling mechanism of CanvasEmb and demonstrate its great potential with two extended applications: layout auto completion and layout retrieval.

Abstract:
Recently, GAN based method has demonstrated strong effectiveness in generating augmentation data for person re-identification (ReID), on account of its ability to bridge the gap between domains and enrich the data variety in feature space. However, most of the ReID works pick all the GAN generated data as additional training samples or evaluate the quality of GAN generation at the entire data set level, ignoring the image-level essential feature of data in ReID task. In this paper, we analyze the in-depth characteristics of ReID sample and solve the problem of "What makes a GAN-generated image good for ReID''. Specifically, we propose to examine each data sample with id-consistency and diversity constraints by mapping image onto different spaces. With a metric-based sampling method, we demonstrate that not every GAN-generated data is beneficial for augmentation. Models trained with data filtered by our quality evaluation outperform those trained with the full augmentation set by a large margin. Extensive experiments show the effectiveness of our method on both supervised ReID task and unsupervised domain adaptation ReID task.

Abstract:
In this work, we study the problem of separating the global camera motion and the local dynamic motion from an optical flow. Previous methods either estimate global motions by a parametric model, such as a homography, or estimate both of them by an optical flow field. However, none of these methods can directly estimate global and local motions through an end-to-end manner. In addition, separating the two motions accurately from a hybrid flow field is challenging. Because one motion can easily confuse the estimate of the other one when they are compounded together. To this end, we propose an end-to-end global and local motion estimation network GLM-Net. We design two encoder-decoder structures for the motion separation in the optical flow based on different task orientations. One structure adopts a mask autoencoder to extract the global motion, while the other one uses attention U-net for the local motion refinement. We further designed two effective training methods to overcome the problem of lacking supervisions. We apply our method on the action recognition datasets NCAA and UCF-101 to verify the accuracy of the local motion, and the homography estimation dataset DHE for the accuracy of the global motion. Experimental results show that our method can achieve competitive performance in both tasks at the same time, validating the effectiveness of the motion separation.

Abstract:
Most existing cluster-based cross-domain person re-identification (re-id) methods only pre-train the re-id model on the source domain. Unfortunately, the pre-trained model may not perform well on the target domain due to the large domain gap between source and target domains, which is harmful to the following optimization. In this paper, we propose a novel Self-supervised Pre-training method on the Target Domain (SPTD), which pre-trains the model on both the source and target domains in a self-supervised manner. Specifically, SPTD uses different kinds of data augmentation manners to simulate different intra-class changes and constraints the consistency between the augmented data distribution and the original data distribution. As a result, the pre-trained model involves some specific discriminative knowledge on the target domain and is beneficial to the following optimization. It is easy to combine the proposed SPTD with other cluster-based cross-domain re-id methods just by replacing the original pre-trained model with our pre-trained model. Comprehensive experiments on three widely used datasets, i.e. Market1501, DukeMTMC-ReID and MSMT17, demonstrate the effectiveness of SPTD. Especially, the final results surpass previous state-of-the-art methods by a large margin.

Abstract:
We propose an end-to-end solution for recognizing fingerspelling using multi-scale attention with fixed-queries. Fingerspelling recognition in the wild gets challenging because of the multiple sub-problems involved - detecting the signing hand, tracking it across frames, and recognizing subtle variations in a hand gesture. While the current state-of-the-art handles these with external face/hand detectors, optical flow features, and iteratively refining the attention maps, our work proposes a deep learning model that takes in the RGB videos and recognizes fingerspelling with a single forward pass. Without any frame-level supervision, our proposed model learns to pay attention to informative regions in each frame, such as fingers, hand, and face, to recognize signs. Multi-scale features from these attended regions are then processed using a recurrent neural network to recognize the alphabet sequentially. We train our model using a curriculum learning strategy with simpler samples at the beginning, followed by challenging samples at a later stage. We have evaluated our approach on Chicago Fingerspelling Wild and WildPlus datasets and have achieved about 8% and 4% improvements, respectively, compared to the current state-of-the-art methods. Further analysis of our method shows that our attention mechanism is intuitive from a human perspective, and visualizing it offers useful insights into the working of the model.

Abstract:
Deep learning-based models have achieved unprecedented performance in single image super-resolution (SISR). However, existing deep learning-based models usually require high computational complexity to generate high-quality images, which limits their applications in edge devices, e.g., mobile phones. To address this issue, we propose a dynamic, channel-agnostic filtering method in this paper. The proposed method not only adaptively generates convolutional kernels based on the local information of each position, but also can significantly reduce the cost of computing the inter-channel redundancy. Based on this, we further propose a simple, yet effective, deep lightweight model for SISR. Experiment results show that our proposed model outperforms other state-of-the-art deep lightweight SISR models, leading to the best trade-off between the performance and the number of model parameters.

Abstract:
In this paper, a novel quality caption model is inventively developed to assess the image quality with hierarchical semantics. Existing image quality assessment (IQA) methods usually represent image quality with a quantitative value, resulting in inconsistency with human cognition. Generally, human beings are good at perceiving image quality in terms of semantic description rather than quantitative value. Moreover, cognition is a needs-oriented task where hierarchical semantics are extracted. The mediocre quality value fails to reflect degradations on hierarchical semantics. Therefore, a new IQA framework is proposed to describe the quality for needs-oriented cognition. A novel quality caption procedure is firstly introduced, in which the quality is represented as patterns of activation distributed across the diverse degradations on hierarchical semantics. Then, an attentive and recurrent semantic attractor network (ARSANet) is designed to activate the distributed patterns for image quality description. Experiments demonstrate that our method achieves superior performance and is highly compliant with human cognition.

Abstract:
Image captioning aims to generate a sentence consisting of sequential linguistic words, to describe visual units (i.e., objects, relationships, and attributes) in a given image. Most of existing methods rely on the prevalent supervised learning with cross-entropy (XE) function to transfer visual units into a sequence of linguistic words. However, we argue that the XE objective is not sensitive to visual-linguistic alignment, which cannot discriminately penalize the semantic inconsistency and shrink the context gap. To solve these problems, we propose the Triangle-Reward Reinforcement Learning (TRRL) method. TRRL uses the scene graph (G)---objects as nodes and relationships as edges---to represent images, generated sentences, and ground truth sentences individually, and mutually align them during the training process. Specifically, TRRL formulates the image captioning into cooperative agents, where the first agent aims to extract visual scene graph (Gimg) from image (I) and the second agent translates this graph into sentence (S). To discriminately penalize the visual-linguistic inconsistency, TRRL proposes the novel triangle-reward function: 1) the generated sentence and its corresponding ground truth are decomposed into the linguistic scene graph (Gsen) and ground-truth scene graph (Ggt), respectively; 2) Gimg, Gsen, and Ggt are paired to calculate the semantic similarity scores which are proportionally assigned to reward each agent. Meanwhile, to make the training objective sensitive to context changes, we propose the node-level and triplet-level scoring methods to jointly measure the visual-linguistic graph correlations. Extensive experiments on the MSCOCO dataset demonstrate the superiority of TRRL. Additional ablation studies further validate its effectiveness.

Abstract:
Occluded person re-identification (ReID) aims to match person images with occlusion. It is fundamentally challenging because of the serious occlusion which aggravates the misalignment problem between images. At the cost of incorporating a pose estimator, many works introduce pose information to alleviate the misalignment in both training and testing. To achieve high accuracy while preserving low inference complexity, we propose a network named Pose-Guided Feature Learning with Knowledge Distillation (PGFL-KD), where the pose information is exploited to regularize the learning of semantics aligned features but is discarded in testing. PGFL-KD consists of a main branch (MB), and two pose-guided branches, e.g., a foreground-enhanced branch (FEB), and a body part semantics aligned branch (SAB). The FEB intends to emphasise the features of visible body parts while excluding the interference of obstructions and background (e.g., foreground feature alignment). The SAB encourages different channel groups to focus on different body parts to have body part semantics aligned representation. To get rid of the dependency on pose information when testing, we regularize the MB to learn the merits of the FEB and SAB through knowledge distillation and interaction-based training. Extensive experiments on occluded, partial, and holistic ReID tasks show the effectiveness of our proposed network.

Abstract:
Generating "bullet-time" effects of human free-viewpoint videos is critical for immersive visual effects and VR/AR experience. Recent neural advances still lack the controllable and interactive bullet-time design ability for human free-viewpoint rendering, especially under the real-time, dynamic and general setting for our trajectory-aware task. To fill this gap, in this paper we propose a neural interactive bullet-time generator (iButter) for photo-realistic human free-viewpoint rendering from dense RGB streams, which enables flexible and interactive design for human bullet-time visual effects. Our iButter approach consists of a real-time preview and design stage as well as a trajectory-aware refinement stage. During preview, we propose an interactive bullet-time design approach by extending the NeRF rendering to a real-time and dynamic setting and getting rid of the tedious per-scene training. To this end, our bullet-time design stage utilizes a hybrid training set, light-weight network design and an efficient silhouette-based sampling strategy. During refinement, we introduce an efficient trajectory-aware scheme within 20 minutes, which jointly encodes the spatial, temporal consistency and semantic cues along the designed trajectory, achieving photo-realistic bullet-time viewing experience of human activities. Extensive experiments demonstrate the effectiveness of our approach for convenient interactive bullet-time design and photo-realistic human free-viewpoint video generation.

Abstract:
Multi-view logo classification is a challenging task due to the cross-view misalignment of logo image varies under different viewpoints, large intra-classes and small inter-classes variation of logo appearance. Cross-view data can represent objects from different views and thus provide complementary information for data analysis. However, most existing multi-view algorithms usually maximize the correlation between different views for consistency. Those methods ignore the interaction among different views and may cause semantic bias during the process of common feature learning. In this paper, we investigate the information bottleneck (IB) to the multi-view learning for extracting the different view common features of one category, named Dual-View Information Bottleneck representation (Dual-view IB). To the best of our knowledge, this is the first cross-view learning method for logo classification. Specifically, we maximize the mutual information between the representations of the two views to achieve the preservation of key features in the classification task, while eliminating the redundant information that is not shared between the two views. In addition, due to the unbalance of samples and limited computing resources, we further introduce a novel Pair Batch Data Augmentation (PB) algorithm for Dual-view IB model, which applies augmentations from a learned policy based on replicates instances of two samples within the same batch. Comprehensive experiments on three existing benchmark datasets, which demonstrate the effectiveness of the proposed method that outperforms the methods in the state of the art. The proposed method is expected to further the development of cross-view representation learning.

Abstract:
Recent years have witnessed the booming of online video platforms. Along this line, a graph to illustrate social relation among characters has been long expected to not only benefit the audiences for better understanding the story, but also support the fine-grained video analysis task in a semantic way. Unfortunately, though we humans could easily infer the social relations among characters, it is still an extremely challenging task for intelligent systems to automatically capture the social relation by absorbing multi-modal cues. Besides, they fail to describe the relations among multiple characters in a graph-generation perspective. To that end, inspired by the human inference ability on social relationship, we propose a novel Hierarchical- Cumulative Graph Convolutional Network (HC-GCN) to generate the social relation graph for multiple characters in the video. Specifically, we first integrate the short-term multi-modal cues, including visual, textual and audio information, to generate the frame-level graphs for part of characters via multimodal graph convolution technique. While dealing with the video-level aggregation task, we design an end-to-end framework to aggregate all frame-level subgraphs along the temporal trajectory, which results in a global video-level social graph with various social relationships among multiple characters. Extensive validations on two real-world large-scale datasets demonstrate the effectiveness of our proposed method compared with SOTA baselines.

Abstract:
ACM Multimedia 2021 Video Relation Understanding Challenge is the third grand challenge which aims at exploring the relationship of subjects and objects appearing in videos for fine-grained and high-level video understanding. Given a video, the video relation detection model should output a serious of relation triplet subject, predicate, object and the corresponding trajectories of subject and object. The goal of this task is to promote research on developing video semantic understanding model, so as to perform complex inferences and mining of visual knowledge in videos. In this paper, we make a comprehensive and detailed introduction of this task, conclude the proposed algorithms in the last few years, and propose future direction for research in this task.

Abstract:
This paper aims to propose an automatic micro-expression spotting method of high accuracy and high robustness. Due to the characteristics of small amplitude and short duration, how to accurately capture the subtle movements of micro-expression is a complex problem. The optical flow method is applied to estimate the motion trend of the facial regions. Because the head shaking is an essential reason for the high false-positive rate of micro-expression spotting, a reliable face alignment method becomes crucial. According to the optical flow of the nose tip region, the cutting box was adjusted several times to optimize the relative position between the face and the cutting box stable. On this basis, the optical flow features from the 14 regions of interest on the face are used to build a feature matrix, and a wave peak location technology is proposed to accurately locate the moment when the micro-expression occurs on the time-domain curve of the features. The experimental results on the CAS(ME)2-cropped and the SAMM Long Videos datasets show that our method performs significantly better than the baseline method and has a high application value in various application scenarios.

Abstract:
In this work we address the Next Speaker Prediction sub challenge of the ACM '21 MultiMediate Grand Challenge. This challenge poses the problem of turn taking prediction in physically situated multiparty interaction. Solving this problem is essential for enabling fluent real-time multiparty human-machine interaction. This problem is made more difficult by the need for a robust solution that can perform effectively across a wide variety of settings and contexts. Prior work has shown that current state-of-the-art methods rely on machine learning approaches that do not generalize well to new settings and feature distributions. To address this problem, we propose the use of group-level focus of visual attention as additional information. We show that a simple combination of group-level focus of visual attention features and publicly available audio-video synchronizer models is competitive with state-of-the-art methods fine-tuned for the challenge dataset.

Abstract:
The natural language processing community has had a major interest in auto-regressive [4, 13] and span-prediction based language models [7] recently, while knowledge graphs are often referenced for common-sense based reasoning and fact-checking models. In this paper, we present an equivalence representation of span-prediction based language models and knowledge-graphs to better leverage recent developments of language modelling for multi-modal problem statements. Our method performed well, especially with sentiment understanding for multi-modal inputs, and discovered potential bias in naturally occurring videos when compared with movie-data interaction-understanding. We also release a dataset of an auto-generated questionnaire with ground-truths consisting of labels spanning across 120 relationships, 99 sentiments, and 116 interactions, among other labels for finer-grained analysis of model comparisons in the community.

Abstract:
There has been an increasing emphasis on building large-scale datasets as the driver of deep learning-based trackers' success. However, accurately annotating tracking data is highly labor-intensive and expensive, making it infeasible in real-world applications. In this study, we investigate the necessity of large-scale training data to ensure tracking algorithms' performance. To this end, we introduce a FAT (Few-Annotation Tracking) benchmark constructed by sampling one or a few frames per video from some existing tracking datasets. The proposed dataset can be used to evaluate the effectiveness of tracking algorithms considering data efficiency and new data augmentation approaches for object tracking. We further present AMMC (Augmentation by Mimicking Motion Change), a data augmentation strategy that enables learning high-performing trackers using small-scale datasets. AMMC first cuts out the tracked targets and performs a sequence of transformations to simulate the possible change by object motion. Then the transformed targets are pasted on the inpainted background images and further conjointly augmented to mimic variability caused by camera motion. Compared with standard augmentation methods, AMMC explicitly considers tracking data characteristics, which synthesizes more valid data for object tracking. We extensively evaluate our approach with two popular trackers on the FAT datasets. Experiments show that our method allows these trackers to even trained on a dataset requiring much less annotation to achieve comparable or even better performance to those on the full-annotation dataset. The results imply complete video annotation might not be necessary for object tracking if leveraging motion-driven data augmentations during training.

Abstract:
Salient object detection (SOD) has made great progress, but most of existing SOD methods focus more on performance than efficiency. Besides, the U-shape structure exists some drawbacks and there is still a lot of room for improvement. Therefore, we propose a novel framework to treat semantic context, spatial detail and boundary information separately in the decoder part. Specifically, we propose an efficient and effective Complementary Trilateral Decoder (CTD) for saliency detection with three branches: Semantic Path, Spatial Path and Boundary Path. These three branches are designed to solve the dilution of semantic information, loss of spatial information and absence of boundary information, respectively. These three branches are complementary to each other and we design three distinctive fusion modules to gradually merge them according to "coarse-fine-finer'' strategy, which significantly improves the region accuracy and boundary quality. To facilitate the practical application in different environments, we provide two versions: CTDNet-18 (11.82M, 180FPS) and CTDNet-50 (24.63M, 110FPS). Experiments show that our model performs better than state-of-the-art approaches on five benchmarks, which achieves a favorable balance between speed and accuracy.

Abstract:
The RGB-infrared cross-modality person re-identification (ReID) task aims to recognize the images of the same identity between the visible modality and the infrared modality. Existing methods mainly use a two-stream architecture to eliminate the discrepancy between the two modalities in the final common feature space, which ignore the single space of each modality in the shallow layers. To solve it, in this paper, we present a novel multi-feature space joint optimization (MSO) network, which can learn modality-sharable features in both the single-modality space and the common space. Firstly, based on the observation that edge information is modality-invariant, we propose an edge features enhancement module to enhance the modality-sharable features in each single-modality space. Specifically, we design a perceptual edge features (PEF) loss after the edge fusion strategy analysis. According to our knowledge, this is the first work that proposes explicit optimization in the single-modality feature space on cross-modality ReID task. Moreover, to increase the difference between cross-modality distance and class distance, we introduce a novel cross-modality contrastive-center (CMCC) loss into the modality-joint constraints in the common feature space. The PEF loss and CMCC loss jointly optimize the model in an end-to-end manner, which markedly improves the network's performance. Extensive experiments demonstrate that the proposed model significantly outperforms state-of-the-art methods on both the SYSU-MM01 and RegDB datasets.

Abstract:
This paper considers the problem of generating an HDR image of a scene from its LDR images. Recent studies employ deep learning and solve the problem in an end-to-end fashion, leading to significant performance improvements. However, it is still hard to generate a good quality image from LDR images of a dynamic scene captured by a hand-held camera, e.g., occlusion due to the large motion of foreground objects, causing ghosting artifacts. The key to success relies on how well we can fuse the input images in their feature space, where we wish to remove the factors leading to low-quality image generation while performing the fundamental computations for HDR image generation, e.g., selecting the best-exposed image/region. We propose a novel method that can better fuse the features based on two ideas. One is multi-step feature fusion; our network gradually fuses the features in a stack of blocks having the same structure. The other is the design of the component block that effectively performs two operations essential to the problem, i.e., comparing and selecting appropriate images/regions. Experimental results show that the proposed method outperforms the previous state-of-the-art methods on the standard benchmark tests.

Abstract:
Composed image retrieval aims at performing image retrieval task by giving a reference image and a complementary text piece. Since composing both image and text information can accurately model the users' search intent, composed image retrieval can perform target-specific image retrieval task and be potentially applied to many scenarios such as interactive product search. However, two key challenging issues must be addressed in composed image retrieval occasion. One of them is how to fuse heterogeneous image and text piece in the query into a complementary feature space. The other is how to bridge the heterogeneous gap between text pieces in the query and images in the database. To address the issues, we propose an end-to-end framework for composed image retrieval, which consists of three key components including Multi-modal Complementary Fusion (MCF), Cross-modal Guided Pooling (CGP), and Relative Caption-aware Consistency (RCC). By incorporating MCF and CGP modules, we can fully integrate the complementary information of image and text piece in the query through multiple deep interactions and aggregate obtained local features into an embedding vector. To bridge the heterogeneous gap, we introduce the RCC constraint to align text pieces in the query and images in the database. Extensive experiments on four public benchmark datasets show that the proposed composed image retrieval framework achieves outstanding performance against the state-of-the-art methods.

Abstract:
Old photos are an important carrier to preserve the past. Usually, the degradation of old photos is rather diverse and complex. Therefore, the existing methods to solve conventional restoration tasks are difficult to generalize. To solve this problem, we propose a novel method based on generative adversarial network. Our method utilizes the class-attributes of old photos to complete restoration in latent space. Specifically, we divide the process of restoring old photos into two stages, one is global defect restoration stage and the other is local detail restoration stage. In global defect restoration stage, we extract the latent representations of four classes of high-level attributes that are smoothness, clarity, connectivity and completeness. We use latent class-attribute information to restore global defects in latent space and we obtain conditional control vector through a condition network to guide the subsequent local detail restoration stage. In local detail restoration stage, we propose a dynamic condition-guided restoration module that selects the most suitable combination of features to further restore local details through a dynamic network. In addition, we propose a dual discriminator to pay more attention to style and defect restoration. We ignore the complex degradation of old photos to directly restore advanced class-attributes. Therefore, our method has better generalization performance. Experiments show that our method is superior to other existing methods of image restoration in terms of visual quality and numerical metrics.

Abstract:
Facial action unit (AU) detection is a challenging task due to the variety and subtlety of individuals' facial behavior. Facial muscle characteristics such as temporal dependencies and action correlations make AU detection differ from general multi-label classification tasks, and capturing these two characteristics is the key to accurate AU detection. However, there is little work to date taking both of them into consideration concurrently. To capture the AU correlations in an image, we first disentangle the global (image) feature into multiple AU-specific features with an AU contrastive loss, and then we compute the feature for each AU by aggregating the features from the other AUs with a self-attention based transformer. Different from the original transformer, we embed the AU semantic dependency matrix into it to weakly guide the attention learning. We then weighted fuse the AU-wise features to obtain the frame-wise features. We further capture the temporal dependencies among frames by using another attention-based transformer, which achieves information aggregation from the prior frames. Extensive experiments on two benchmark datasets (i.e., BP4D and DISFA) demonstrate that the proposed framework outperforms the state-of-the-art approaches.

Abstract:
Previous work has demonstrated that incorrect white balance (WB) in the camera image signal processing pipeline has a negative impact on the performance of deep neural networks (DNNs) in high-level vision tasks, and traditional image augmentation approaches are not well suited for modeling WB errors. However, it is still unclear when this impact will occur for which kinds of images and objects. In this paper, we manually labeled 2304 images from the RECommended dataset and NUS dataset and discovered that the effect of WB on DNNs is greatly associated with object size and occlusion level among objects. In images with incorrect WB, small objects and objects with heavily occluded backgrounds are the main factors resulting in the bad performance of DNNs, indicating that the effect of WB is clearly associated with the shape of objects. Our findings may support that the functional role of some neurons in the visual cortex (e.g., V1 or V4 areas) realizing color constancy (CC) and encoding object attributes such as color and shape dependently is to contribute to high-level vision. Furthermore, based on this scientific finding, we proposed a novel augmentation strategy to address the negative impact of incorrect WB by expanding the training datasets in both color transformation and synthetic occlusion. We compared our proposed strategy with the current augmentation strategies and showed that our approach clearly improves the performance of DNNs in detection and segmentation tasks with small objects and objects with heavily occluded backgrounds.

Abstract:
Among all solutions of emotion recognition tasks, electroencephalogram (EEG) is a very effective tool and has received broad attention from researchers. In addition, information across multimedia in EEG often provides a more complete picture of emotions. However, few of the existing studies concurrently incorporate EEG information from temporal domain, frequency domain and functional brain connectivity. In this paper, we propose a Multi-Domain Adaptive Graph Convolutional Network (MD-AGCN), fusing the knowledge of both the frequency domain and the temporal domain to fully utilize the complementary information of EEG signals. MD-AGCN also considers the topology of EEG channels by combining the inter-channel correlations with the intra-channel information, from which the functional brain connectivity can be learned in an adaptive manner. Extensive experimental results demonstrate that our model exceeds state-of-the-art methods in most experimental settings. At the same time, the results show that MD-AGCN could extract complementary domain information and exploit channel relationships for EEG-based emotion recognition effectively.

Abstract:
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs with mask tokens to trigger mask-based proxy pre-training tasks (e.g., masked language modeling and masked object/frame prediction). In this work, we argue that such masked inputs would inevitably introduce noise for cross-modal matching proxy task, and thus leave the inherent vision-language association under-explored. As an alternative, we derive a particular form of cross-modal proxy objective for video-language pre-training, i.e., Contrastive Cross-modal matching and denoising (CoCo). By viewing the masked frame/word sequences as the noisy augmentation of primary unmasked ones, CoCo strengthens video-language association by simultaneously pursuing inter-modal matching and intra-modal denoising between masked and unmasked inputs in a contrastive manner. Our CoCo proxy objective can be further integrated into any BERT-type encoder-decoder structure for video-language pre-training, named as Contrastive Cross-modal BERT (CoCo-BERT). We pre-train CoCo-BERT on TV dataset and a newly collected large-scale GIF video dataset (ACTION). Through extensive experiments over a wide range of downstream tasks (e.g., cross-modal retrieval, video question answering, and video captioning), we demonstrate the superiority of CoCo-BERT as a pre-trained structure.

Abstract:
The fourth ACM International Workshop on Multimedia Content Analysis in Sports (ACM MMSports'21) is part of the ACM International Conference on Multimedia 2021 (ACM Multimedia 2021). Exceptionally, due to the corona pandemic, the workshop is held virtually. The goal of this workshop is to bring together researchers and practitioners from academia and industry to address challenges and report progress in mining, analyzing, understanding and visualizing the multimedia/multimodal data in sports, sports broadcasts, sports games and sports medicine. The combination of sports and modern technology offers a novel and intriguing field of research with promising approaches for visual broadcast augmentation and understanding, statistical analysis and evaluation, and sensor fusion during workouts as well as competitions. There is a lack of research communities focusing on the fusion of multiple modalities. We are helping to close this research gap with this workshop series on multimedia content analysis in sports.

Abstract:
Understanding complex processes that give cities their form traditionally relied primarily on the analysis of various open data statistics in relation to e.g. neighbourhood demographics, economy and mobility. However, recent years have seen an unprecedented increase in the availability and use of city-related sensors, participatory data and social multimedia. As the valuable information about urban challenges is usually encoded across multiple modalities, such as visual (e.g. panoramic, satellite and user-contributed images), text (e.g. social media and participatory data) and open data statistics, extracting this information requires effective multimedia analysis tools. This Workshop will showcase the power of multimedia computing in addressing various urban challenges, ranging from event detection and analysis, location recommendation and crowdedness estimation to more efficient handling of citizen reports and modelling and improving city liveability. In addition, it will serve as an impulse for the multimedia community to intensify research on these interesting, challenging and truly multimodal problems.

Abstract:
With the advent of deep neural networks, quite a lot of multimedia tasks have been significantly improved. While, however, deep neural networks still lack the ability of learning from less labeling, e.g., with limited exemplars or fast generalizing to new tasks. In order to address the current inefficiency of multimedia, there is pressing need to research methods to drastically reduce requirements for labeled training data. This workshop aims to provide a platform for discussing the challenges and corresponding innovative approaches in multimedia with less labeling. We hope more advanced technologies can be proposed or inspired, and also we invite several domain-specific experts for sharing their insights and research progress on the topic of MULL.

Abstract:
The 2nd Multimodal Sentiment Analysis (MuSe) 2021 Challenge-based Workshop is held in conjunction with ACM Multimedia'21. Two datasets are provided as part of the challenge. Firstly, the MuSe-CaR dataset, which focuses on user-generated, emotional vehicle reviews from YouTube, and secondly, the novel Ulm-Trier Social Stress (Ulm-TSST) dataset, which shows people in stressful circumstances. Participants are faced with four sub-challenges: predicting arousal and valence in a time- and value-continuous manner on a) MuSe-CaR (MuSe-Wilder) and b) Ulm-TSST (MuSe-Stress); c) predicting unsupervised created emotion classes on MuSe-CaR (MuSe-Sent); d) predicting a fusion of human-annotated arousal and measured galvanic skin response also as a continuous target on Ulm-TSST (MuSe-Physio). In this summary, we describe the motivation, the sub-challenges, the challenge conditions, the participation, and the most successful approaches.

Abstract:
Single image dehazing is a challenging task, for which the domain shift between synthetic training data and real-world testing images usually leads to degradation of existing methods. To address this issue, we propose a novel image dehazing framework collaborating with unlabeled real data. First, we develop a disentangled image dehazing network (DID-Net), which disentangles the feature representations into three component maps, i.e. the latent haze-free image, the transmission map, and the global atmospheric light estimate, respecting the physical model of a haze process. Our DID-Net predicts the three component maps by progressively integrating features across scales, and refines each map by passing an independent refinement network. Then a disentangled-consistency mean-teacher network (DMT-Net) is employed to collaborate unlabeled real data for boosting single image dehazing. Specifically, we encourage the coarse predictions and refinements of each disentangled component to be consistent between the student and teacher networks by using a consistency loss on unlabeled real data. We make comparison with 13 state-of-the-art dehazing methods on a new collected dataset (Haze4K) and two widely-used dehazing datasets (i.e., SOTS and HazeRD), as well as on real-world hazy images. Experimental results demonstrate that our method has obvious quantitative and qualitative improvements over the existing methods.

Abstract:
Deep Convolutional Neural Networks (CNNs) have achieved great success in image classification. While conventional CNNs optimized with iterative gradient descent algorithms with large data have been widely used and investigated, there is also research focusing on learning CNNs with non-iterative optimization methods such as the principle component analysis network (PCANet). It is very simple and efficient but achieves competitive performance for some image classification tasks especially on tasks with only a small amount of data available. This paper further extends this line of research and proposes a deep Marginal Fisher Analysis (MFA) based CNN, termed as DMNet. It addresses the limitation of PCANet like CNNs when the samples do not follow Gaussian distribution, by using a local MFA for CNN filter optimization. It uses a graph embedding framework for convolution filter optimization by maximizing the inter-class discriminability among marginal points while minimizing intra-class distance. Cascaded MFA convolution layers can be used to construct a deep network. Moreover, a binary stochastic hashing is developed by randomly selecting features with a probability based on the importance of feature maps for binary hashing. Experimental results demonstrate that the proposed method achieves state-of-the-art result in non-iterative optimized CNN methods, and ablation studies have been conducted to verify the effectiveness of the proposed modules in our DMNet.

Abstract:
Encouraging progress has been made towards Visual Question Answering (VQA) in recent years, but it is still challenging to enable VQA models to adaptively generalize to out-of-distribution (OOD) samples. Intuitively, recompositions of existing visual concepts (i.e., attributes and objects) can generate unseen compositions in the training set, which will promote VQA models to generalize to OOD samples. In this paper, we formulate OOD generalization in VQA as a compositional generalization problem and propose a graph generative modeling-based training scheme (X-GGM) to handle the problem implicitly. X-GGM leverages graph generative modeling to iteratively generate a relation matrix and node representations for the predefined graph that utilizes attribute-object pairs as nodes. Furthermore, to alleviate the unstable training issue in graph generative modeling, we propose a gradient distribution consistency loss to constrain the data distribution with adversarial perturbations and the generated distribution. The baseline VQA model (LXMERT) trained with the X-GGM scheme achieves state-of-the-art OOD performance on two standard VQA OOD benchmarks, i.e., VQA-CP v2 and GQA-OOD. Extensive ablation studies demonstrate the effectiveness of X-GGM components.

Abstract:
Pattern images are artificially designed images which are discriminative in aspects of elements, styles, arrangements and so on. Pattern images are widely used in fields like textile, clothing, art, fashion and graphic design. With the growth of image numbers, pattern image retrieval has great potential in commercial applications and industrial production. However, most of existing content-based image retrieval works mainly focus on describing simple attributes with clear conceptual boundaries, which are not suitable for pattern image retrieval. It is difficult to accurately represent and retrieve pattern images which include complex details and multiple elements. Therefore, in this paper, we collect a new pattern image dataset with multiple labels per image for the pattern image retrieval task. To extract discriminative semantic features of multi-label pattern images and construct high-level topology relationships between features, we further propose an Attention Mechanism Driven Graph Convolutional Network (AMD-GCN). Different layers of the multi-semantic attention module activate regions of interest corresponding to multiple labels, respectively. By embedding the learned labels from attention module into the graph convolutional network, which can capture the dependency of labels on the graph manifold, the AMD-GCN builds an end-to-end framework to extract high-level semantic features with label semantics and inner relationships for retrieval. Experiments on the pattern image dataset show that the proposed method highlights the relevant semantic regions of multiple labels, and achieves higher accuracy than state-of-the-art image retrieval methods.

Abstract:
We introduce MeronymNet, a novel hierarchical approach for controllable, part-based generation of multi-category objects using a single unified model. We adopt a guided coarse-to-fine strategy involving semantically conditioned generation of bounding box layouts, pixel-level part layouts and ultimately, the object depictions themselves. We use Graph Convolutional Networks, Deep Recurrent Networks along with custom-designed Conditional Variational Autoencoders to enable flexible, diverse and category-aware generation of 2-D objects in a controlled manner. The performance scores for generated objects reflect MeronymNet's superior performance compared to multiple strong baselines and ablative variants. We also showcase MeronymNet's suitability for controllable object generation and interactive object editing at various levels of structural and semantic granularity.

Abstract:
The task of instance segmentation in videos aims to consistently identify objects at pixel level throughout the entire video sequence. Existing state-of-the-art methods either follow the tracking-by-detection paradigm to employ multi-stage pipelines or directly train a complex deep model to process the entire video clips as 3D volumes. However, these methods are typically slow and resource-consuming such that they are often limited to offline processing. In this paper, we propose SRNet, a simple and efficient framework for joint segmentation and tracking of object instances in videos. The key to achieving both high efficiency and accuracy in our framework is to formulate the instance segmentation and tracking problem into a unified spatial-relation learning task where each pixel in the current frame relates to its object center, and each object center relates to its location in the previous frame. This unified learning framework allows our framework to perform join instance segmentation and tracking through a single stage while maintaining low overheads among different learning tasks. Our proposed framework can handle two different task settings and demonstrates comparable performance with state-of-the-art methods on two different benchmarks while running significantly faster.

Abstract:
Today's mainstream shadow detection methods are manually designed via a case-by-case approach. Accordingly, these methods may only be able to detect shadows for specific scenes. Given the complex and diverse shadow scenes in reality, none of the existing methods can provide a one-size-fits-all solution with satisfactory performance. To address this problem, this paper introduces a new concept, named shadow detection confidence, which can be used to evaluate the effect of any shadow detection method for any given scene. The best detection effect for a scene is achieved by combining prediction results by multiple methods. To measure the shadow detection confidence characteristics of an image, a novel relative confidence map prediction network (RCMPNet) is proposed. Experimental results show that the proposed method outperforms multiple state-of-the-art shadow detection methods on four shadow detection benchmark datasets.

Abstract:
Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.

Abstract:
Joint learning of scene parsing and depth estimation remains a challenging task due to the rivalry between the two tasks. In this paper, we revisit the mutual enhancement for joint semantic segmentation and depth estimation. Inspired by the observation that the competition and cooperation could be reflected in the feature frequency components of different tasks, we propose a Frequency Aware Feature Enhancement (FAFE) network that can effectively enhance the reciprocal relationship whereas avoiding the competition. In FAFE, a frequency disentanglement module is proposed to fetch the favorable frequency component sets for each task and resolve the discordance between the two tasks. For task cooperation, we introduce a re-calibration unit to aggregate features of the two tasks, so as to complement task information with each other. Accordingly, the learning of each task can be boosted by the complementary task appropriately. Besides, a novel local-aware consistency loss function is proposed to impose on the predicted segmentation and depth so as to strengthen the cooperation. With the FAFE network and new local-aware consistency loss encapsulated into the multi-task learning network, the proposed approach achieves superior performance over previous state-of-the-art methods. Extensive experiments and ablation studies on multi-task datasets demonstrate the effectiveness of our proposed approach.

Abstract:
The research on human emotion under multimedia stimulation based on physiological signals is an emerging field and important progress has been achieved for emotion recognition based on multi-modal signals. However, it is challenging to make full use of the complementarity among spatial-spectral-temporal domain features for emotion recognition, as well as model the heterogeneity and correlation among multi-modal signals. In this paper, we propose a novel two-stream heterogeneous graph recurrent neural network, named HetEmotionNet, fusing multi-modal physiological signals for emotion recognition. Specifically, HetEmotionNet consists of the spatial-temporal stream and the spatial-spectral stream, which can fuse spatial-spectral-temporal domain features in a unified framework. Each stream is composed of the graph transformer network for modeling the heterogeneity, the graph convolutional network for modeling the correlation, and the gated recurrent unit for capturing the temporal domain or spectral domain dependency. Extensive experiments on two real-world datasets demonstrate that our proposed model achieves better performance than state-of-the-art baselines.

Abstract:
Automatic facial action unit (AU) recognition is a challenging task due to the scarcity of manual annotations. To alleviate this problem, a large amount of efforts has been dedicated to exploiting various methods which leverage numerous unlabeled data. However, many aspects with regard to some unique properties of AUs, such as the regional and relational characteristics, are not sufficiently explored in previous works. Motivated by this, we take the AU properties into consideration and propose two auxiliary AU related tasks to bridge the gap between limited annotations and the model performance in a self-supervised manner via the unlabeled data. Specifically, to enhance the discrimination of regional features with AU relation embedding, we design a task of RoI inpainting to recover the randomly cropped AU patches. Meanwhile, a single image based optical flow estimation task is proposed to leverage the dynamic change of facial muscles and encode the motion information into the global feature representation. Based on these two self-supervised auxiliary tasks, local features, mutual relation and motion cues of AUs are better captured in the backbone network with the proposed regional and temporal based auxiliary task learning (RTATL) framework. Extensive experiments on BP4D and DISFA demonstrate the superiority of our method and new state-of-the-art performances are achieved.

Abstract:
Tag inference is an important task in the business of video platforms with wide applications such as recommendation, interpretation, and more. Existing works are mainly based on extracting video information from multiple modalities such as frames or music, and then infer tags through classification or object detection. This, however, does not apply to inferring generic tags or taxonomy that are less relevant to video contents, such as video originality or its broader category, which are important in practice. In this paper, we claim that these generic tags can be modeled through the semantic relations between videos and tags, and can be utilized simultaneously with the multi-modal features to achieve better video tagging. We propose TransFusion, an end-to-end supervised learning framework that fuses multi-modal embeddings (e.g., vision, audio, texts, etc.) with the knowledge embedding to derive the video representation. To infer the diverse tags following heterogeneous relations, TransFusion adopts a dual attentive approach to learn both the modality importance in fusion and relation importance in inference. Besides, it is general enough and can be used with the existing translation-based knowledge embedding approaches. Extensive experiments show that TransFusion outperforms the baseline methods with lowered mean rank and at least 9.59% improvement in HITS@10 on the real-world video knowledge graph.

Abstract:
We investigate the challenging problem of table structure recognition in this work. Many recent methods adopt graph-based context aggregator with strong inductive bias to reason sparse contextual relationships of table elements. However, the strong constraints may be too restrictive to represent the complicated table relationships. In order to learn more appropriate inductive bias from data, we try to introduce Transformer as context aggregator in this work. Nevertheless, Transformer taking dense context as input requires larger scale data and may suffer from unstable training procedure due to the weakening of inductive bias. To overcome the above limitations, we in this paper design a FLAG (FLexible context AGgregator), which marries Transformer with graph-based context aggregator in an adaptive way. Based on FLAG, an end-to-end framework requiring no extra meta-data or OCR information, termed FLAG-Net, is proposed to flexibly modulate the aggregation of dense context and sparse one for the relational reasoning of table elements. We investigate the modulation pattern in FLAG and show what contextual information is focused, which is vital for recognizing table structure. Extensive experimental results on benchmarks demonstrate the performance of our proposed FLAG-Net surpasses other compared methods by a large margin.

Abstract:
In industry, there exist plenty of scenarios where old gray photos need to be automatically colored, such as video sites and archives. In this paper, we present the HistoryNet focusing on historical person's diverse high fidelity clothing colorization based on fine grained semantic understanding and prior. Colorization of historical persons is realistic and practical, however, existing methods do not perform well in the regards. In this paper, a HistoryNet including three parts, namely, classification, fine grained semantic parsing and colorization, is proposed. Classification sub-module supplies classifying of images according to the eras, nationalities and garment types; Parsing sub-network supplies the semantic for person contours, clothing and background in the image to achieve more accurate colorization of clothes and persons and prevent color overflow. In the training process, we integrate classification and semantic parsing features into the coloring generation network to improve colorization. Through the design of classification and parsing subnetwork, the accuracy of image colorization can be improved and the boundary of each part of image can be more clearly. Moreover, we also propose a novel Modern Historical Movies Dataset (MHMD) containing 1,353,166 images and 42 labels of eras, nationalities, and garment types for automatic colorization from 147 historical movies or TV series made in modern time. Various quantitative and qualitative comparisons demonstrate that our method outperforms the state-of-the-art colorization methods, especially on military uniforms, which has correct colors according to the historical literatures.

Abstract:
Virtual character has been widely adopted in many areas, such as virtual assistant, virtual customer service, robotics and etc. In this paper, we focus on its application in e-commerce live streaming. Particularly, we propose a virtual character generation and animation system that supports e-commerce live streaming with virtual characters as anchors. The system offers a virtual character face generation tool based on a weakly supervised 3D face reconstruction method. The method takes a single photo as input and generates a 3D face model with both similarity and aesthetics considered. It does not require 3D face annotation data due to the assist of differentiable neural rendering technique which seamlessly integrates rendering into a deep learning based 3D face reconstruction framework. Moreover, the system provides two animation approaches which support two different ways of live stream respectively. The first approach is based on real-time motion capture. An actor's performance is captured in real-time via a monocular camera, and then utilized for animating a virtual anchor. The second approach is text driven animation, in which the human-like animation is automatically generated based on a text script. The relationship between text script and animation is learned based on the training data which can be accumulated via the motion capture based animation. To our best knowledge, the presented work is the first sophisticated virtual character generation and animation system that is designed for e-commerce live streaming and actually deployed on an online shopping platform with millions of daily audiences.

Abstract:
Template matching of multi-modal image has been a challenge to image matching, and it is difficult to balance the speed and the accuracy, especially for images with large sizes. Based on this, we propose a stepwise image matching method to achieve a precise location from the coarse-to-fine image matching by utilizing cascaded networks. In the proposed method, a coarse-grained matching network is firstly constructed to locate a rough matching position based on cross-correlating features of optical and SAR images. Specially, to enhance the credible matching position, a suppression network is designed to evaluate for the obtained cross-correlation feature and added into the coarse-grained network as a feedback. Secondly, a fine-grained matching network is constructed based on the obtained rough matching result to gain a more precise matching. In this part, ternary groups are utilized to construct the training samples. Interestingly, we apply the region with a few pixels offset as the negative class, which effectively distinguishes similar neighbourhoods of the rough matching position. Moreover, a modified Siamese network is used to extract features of SAR and optical images, respectively. Finally, experimental results illustrate that the proposed method obtains more precise matching compared with the state-of-the-art methods.

Abstract:
Prevailing Multiple Object Tracking (MOT) works following the Tracking-by-Detection (TBD) paradigm pay most attention to either object detection in a first step or data association in a second step. In this paper, we approach the MOT problem from a different perspective by directly obtaining the embedded spatial-temporal information of trajectories from raw video data. For the purpose we propose a joint trajectory locating and attributes encoding framework for real-time, on-line MOT. We firstly introduce a trajectory attribute representation scheme designed for each tracked target (instead of object) where the extracted Trajectory Map (TM) encodes the spatial-temporal attributes of a trajectory across a window of consecutive video frames. Next we present a Temporal Priors Embedding (TPE) methodology to infer these attributes with a logical reasoning strategy based on long-term feature dynamics. The proposed MOT framework projects multiple attributes of tracked targets, e.g., presence, enter/exit, location, scale, motion, etc. into a continuous TM to perform one-shot regression for real-time MOT. Experimental results show that, our proposed video-based method runs at 33 FPS and is more accurate and robust as compared to the detection-based tracking methods and a few other State-of-the- Art (SOTA) approaches on MOT16/17/20 benchmarks.

Abstract:
Hashing learns compact binary codes to store and retrieve massive data efficiently. Particularly, unsupervised deep hashing is supported by powerful deep neural networks and has the desirable advantage of label independence. It is a promising technique for scalable image retrieval. However, deep models introduce a large number of parameters, which is hard to optimize due to the lack of explicit semantic labels and brings considerable training cost. As a result, the retrieval accuracy and training efficiency of existing unsupervised deep hashing are still limited. To tackle the problems, in this paper, we propose a simple and efficient Lightweight Augmented Graph Network Hashing (LAGNH) method with a two-pronged strategy. For one thing, we extract the inner structure of the image as the auxiliary semantics to enhance the semantic supervision of the unsupervised hash learning process. For another, we design a lightweight network structure with the assistance of the auxiliary semantics, which greatly reduces the number of network parameters that needs to be optimized and thus greatly accelerates the training process. Specifically, we design a cross-modal attention module based on the auxiliary semantic information to adaptively mitigate the adverse effects in the deep image features. Besides, the hash codes are learned by multi-layer message passing within an adversarial regularized graph convolutional network. Simultaneously, the semantic representation capability of hash codes is further enhanced by reconstructing the similarity graph. Experimental results show that our method achieves significant performance improvement compared with the state-of-the-art unsupervised deep hashing methods in terms of both retrieval accuracy and efficiency. Notably, on MS-COCO dataset, our method achieves more than 10% improvement on retrieval precision and 2.7x speedup on training time compared with the second best result.

Abstract:
Shadow puppet play is a representative Chinese intangible cultural heritage, which has a history of more than two thousand years. However, with the popularity of digital media, this traditional art form has become desolate. "Reconstruction" is an interactive digital artwork inspired by the production and performance of Chinese shadow puppet. The scenes and characters are designed based on the art style of shadow puppet. The participant's motion is captured with a Kinect and used to control the motion of the character.

Abstract:
Without appealing to exhaustive labeled data, self-supervised monocular depth estimation (MDE) plays a fundamental role in computer vision. Previous methods usually adopt a one-stage MDE network, which is insufficient to achieve high performance. In this paper, we dig deep into this task to propose an aggressive framework termed AggNet. The framework is based on a training-only progressive two-stage module to perform pseudo counter-surveillance as well as a simple yet effective dual-warp loss function between image pairs. In particular, we first propose a residual module, which follows the MDE network to learn a refined depth. The residual module takes both the initial depth generated from MDE and the initial color image as input to generate refined depth with residual depth learning. Then, the refined depth is leveraged to supervise the initial depth simultaneously during the training period. For inference, only the MDE network is retained to regress depth from a single image, which gains better performance without introducing extra computation. In addition to self-distillation loss, a simple yet effective dual-warp consistency loss is introduced to encourage the MDE network to keep depth consistency between stereo image pairs. Extensive experiments show that our AggNet achieves state-of-the-art performance on the KITTI and Make3D datasets.

Abstract:
Generative adversarial network (GAN)-based models possess superior capability of high-fidelity image synthesis. There are a wide range of semantically meaningful directions in the latent representation space of well-trained GANs, and the corresponding latent space walks are meaningful for semantic controllability in the synthesized images. To explore the underlying organization of a latent space, we propose an unsupervised Density-Preserving Latent Semantics Exploration model (DP-LaSE). The important latent directions are determined by maximizing the variations in intermediate features, while the correlation between the directions is minimized. Considering that latent codes are sampled from a prior distribution, we adopt a density-preserving regularization approach to ensure latent space walks are maintained in iso-density regions, since moving to a higher/lower density region tends to cause unexpected transformations. To further refine semantics-specific transformations, we perform subspace learning over intermediate feature channels, such that the transformations are limited to the most relevant subspaces. Extensive experiments on a variety of benchmark datasets demonstrate that DP-LaSE is able to discover interpretable latent space walks, and specific properties of synthesized images can thus be precisely controlled.

Abstract:
Multi-modality image fusion refers to generating a complementary image that integrates typical characteristics from source images. In recent years, we have witnessed the remarkable progress of deep learning models for multi-modality fusion. Existing CNN-based approaches strain every nerve to design various architectures for realizing these tasks in an end-to-end manner. However, these handcrafted designs are unable to cope with the high demanding fusion tasks, resulting in blurred targets and lost textural details. To alleviate these issues, in this paper, we propose a novel approach, aiming at searching effective architectures according to various modality principles and fusion mechanisms.

Abstract:
With the explosive growth of web videos in recent years, large-scale Content-Based Video Retrieval (CBVR) becomes increasingly essential in video filtering, recommendation, and copyright protection. Segment-level CBVR (S-CBVR) locates the start and end time of similar segments in finer granularity, which is beneficial for user browsing efficiency and infringement detection especially in long video scenarios. The challenge of S-CBVR task is how to achieve high temporal alignment accuracy with efficient computation and low storage consumption. In this paper, we propose a Segment Similarity and Alignment Network (SSAN) in dealing with the challenge which is firstly trained end-to-end in S-CBVR. SSAN is based on two newly proposed modules in video retrieval: (1) An efficient Self-supervised Keyframe Extraction (SKE) module to reduce redundant frame features, (2) A robust Similarity Pattern Detection (SPD) module for temporal alignment. In comparison with uniform frame extraction, SKE not only saves feature storage and search time, but also introduces comparable accuracy and limited extra computation time. In terms of temporal alignment, SPD localizes similar segments with higher accuracy and efficiency than existing deep learning methods. Furthermore, we jointly train SSAN with SKE and SPD and achieve an end-to-end improvement. Meanwhile, the two key modules SKE and SPD can also be effectively inserted into other video retrieval pipelines and gain considerable performance improvements. Experimental results on public datasets show that SSAN can obtain higher alignment accuracy while saving storage and online query computational cost compared to existing methods.

Abstract:
This paper lays the foundation for a new 3D content market by establishing a content security framework using databases and benchmarks for in-depth research on source identification of 3D printed objects. The proposed benchmark, SI3DP dataset, offers a more generalized multimedia forensic technique. Assuming that identifying the source of a 3D printing object can arise from various invisible traces occurring in the printing process, we obtain close-up images, full object images from 252 printed objects from 18 different printing setups. We then propose a benchmark with five challenging tasks such as device-level identification and scan-and-reprint detection using the provided dataset. Our baseline shows that the printer type and its attributes can be identified based on the microscopic difference of surface texture. Contrary to the conventional belief that only microscopic views such as close-up images are useful to identify printer model, we also achieved a certain level of performance even at a relatively macroscopic point of view. We then propose a multitask-multimodal architecture for device-level identification task to exploit rich knowledge from different image modality and task. The SI3DP dataset can promote future in-depth research studies related to digital forensics and intellectual property protection.

Abstract:
3D human pose and shape recovery from a monocular RGB image is a challenging task. Existing learning based methods highly depend on weak supervision signals, e.g. 2D and 3D joint location, due to the lack of in-the-wild paired 3D supervision. However, considering the 2D-to-3D ambiguities existed in these weak supervision labels, the network is easy to get stuck in local optima when trained with such labels. In this paper, we reduce the ambituity by optimizing multiple initializations. Specifically, we propose a three-stage framework named Multi-Initialization Optimization Network (MION). In the first stage, we strategically select different coarse 3D reconstruction candidates which are compatible with the 2D keypoints of input sample. Each coarse reconstruction can be regarded as an initialization leads to one optimization branch. In the second stage, we design a mesh refinement transformer (MRT) to respectively refine each coarse reconstruction result via a self-attention mechanism. Finally, a Consistency Estimation Network (CEN) is proposed to find the best result from mutiple candidates by evaluating if the visual evidence in RGB image matches a given 3D reconstruction. Experiments demonstrate that our Multi-Initialization Optimization Network outperforms existing 3D mesh based methods on multiple public benchmarks.

Abstract:
Under stereo settings, the problem of image super-resolution (SR) and disparity estimation are interrelated that the result of each problem could help to solve the other. The effective exploitation of correspondence between different views facilitates the SR performance, while the high-resolution (HR) features with richer details benefit the correspondence estimation. According to this motivation, we propose a Stereo Super-Resolution and Disparity Estimation Feedback Network (SSRDE-FNet), which simultaneously handles the stereo image super-resolution and disparity estimation in a unified framework and interact them with each other to further improve their performance. Specifically, the SSRDE-FNet is composed of two dual recursive sub-networks for left and right views. Besides the cross-view information exploitation in the low-resolution (LR) space, HR representations produced by the SR process are utilized to perform HR disparity estimation with higher accuracy, through which the HR features can be aggregated to generate a finer SR result. Afterward, the proposed HR Disparity Information Feedback (HRDIF) mechanism delivers information carried by HR disparity back to previous layers to further refine the SR image reconstruction. Extensive experiments demonstrate the effectiveness and advancement of SSRDE-FNet.

Abstract:
Nowadays, scene text recognition has attracted more and more attention due to its various applications. Most state-of-the-art methods adopt an encoder-decoder framework with attention mechanism, which generates text autoregressively from left to right. Despite the convincing performance, the speed is limited because of the one-by-one decoding strategy. As opposed to autoregressive models, non-autoregressive models predict the results in parallel with a much shorter inference time, but the accuracy falls behind the autoregressive counterpart considerably. In this paper, we propose a Parallel, Iterative and Mimicking Network (PIMNet) to balance accuracy and efficiency. Specifically, PIMNet adopts a parallel attention mechanism to predict the text faster and an iterative generation mechanism to make the predictions more accurate. In each iteration, the context information is fully explored. To improve learning of the hidden layer, we exploit the mimicking learning in the training phase, where an additional autoregressive decoder is adopted and the parallel decoder mimics the autoregressive decoder with fitting outputs of the hidden layer. With the shared backbone between the two decoders, the proposed PIMNet can be trained end-to-end without pre-training. During inference, the branch of the autoregressive decoder is removed for a faster speed. Extensive experiments on public benchmarks demonstrate the effectiveness and efficiency of PIMNet. Our code is available in the supplementary material.

Abstract:
As an important field of multimedia, 3D shape recognition has attracted much research attention in recent years. A lot of deep learning models have been proposed for effective 3D shape representation. The view-based methods show the superiority due to the comprehensive exploration of the visual characteristics with the help of established 2D CNN architectures. Generally, the current approaches contain the following disadvantages: First, the most majority of methods lack the consideration for sequential information among the multiple views, which can provide descriptive characteristics for shape representation. Second, the incomprehensive exploration for the multi-view correlations directly affects the discrimination of shape descriptors. Finally, roughly aggregating multi-view features leads to the loss of descriptive information, which limits the shape representation effectiveness. To handle these issues, we propose a novel sequential view based hierarchical attention network (SVHAN) for 3D shape recognition. Specifically, we first divide the view sequence into several view blocks. Then, we introduce a novel hierarchical feature aggregation module (HFAM), which hierarchically exploits the view-level, block-level, and shape-level features, the intra- and inter- view-block correlations are also captured to improve the discrimination of learned features. Subsequently, a novel selective fusion module (SFM) is designed for feature aggregation, considering the correlations between different levels and preserving effective information. Finally, discriminative and informative shape descriptors are generated for the recognition task. We validate the effectiveness of our proposed method on two public databases. The experimental results show the superiority of SVHAN against the current state-of-the-art approaches.

Abstract:
Recent deep networks have convincingly demonstrated high capability in crowd counting, which is a critical task attracting widespread attention due to its various industrial applications. Despite such progress, trained data-dependent models usually can not generalize well to unseen scenarios because of the inherent domain shift. To facilitate this issue, this paper proposes a novel adversarial scoring network (ASNet) to gradually bridge the gap across domains from coarse to fine granularity. In specific, at the coarse-grained stage, we design a dual-discriminator strategy to adapt source domain to be close to the targets from the perspectives of both global and local feature space via adversarial learning. The distributions between two domains can thus be aligned roughly. At the fine-grained stage, we explore the transferability of source characteristics by scoring how similar the source samples are to target ones from multiple levels based on generative probability derived from coarse stage. Guided by these hierarchical scores, the transferable source features are properly selected to enhance the knowledge transfer during the adaptation process. With the coarse-to-fine design, the generalization bottleneck induced from the domain discrepancy can be effectively alleviated. Three sets of migration experiments show that the proposed methods achieve state-of-the-art counting performance compared with major unsupervised methods.

Abstract:
Image-Text Matching (ITM) is a fundamental and emerging task, which plays a key role in cross-modal understanding. It remains a challenge because prior works mainly focus on learning fine-grained (i.e. coarse and/or phrase) correspondence, without considering the syntactical correspondence. In theory, a sentence is not only a set of words or phrases but also a syntactic structure, consisting of a set of basic syntactic tuples (i.e.(attribute) object - predicate - (attribute) subject). Inspired by this, we propose a Conceptual and Syntactical Cross-modal Alignment with Cross-level Consistency (CSCC) for Image-text Matching by simultaneously exploring the multiple-level cross-modal alignments across the concept and syntactic with a consistency constraint. Specifically, a conceptual-level cross-modal alignment is introduced for exploring the fine-grained correspondence, while a syntactical-level cross-modal alignment is proposed to explicitly learn a high-level syntactic similarity function. Moreover, an empirical cross-level consistent attention loss is introduced to maintain the consistency between cross-modal attentions obtained from the above two cross-modal alignments. To justify our method, comprehensive experiments are conducted on two public benchmark datasets, i.e. MS-COCO (1K and 5K) and Flickr30K, which show that our CSCC outperforms state-of-the-art methods with fairly competitive improvements.

Abstract:
Brightening low-light images of diverse scenes is a challenging but widely concerned task in the multimedia community. Convolutional Neural Networks (CNNs) based approaches mostly acquire the enhanced model by learning the data distribution from the specific scenes. However, these works present poor adaptability (even fail) when meeting real-world scenarios that never encountered before. To conquer it, we develop a novel bilevel learning scheme for fast adaptation to bridge the gap between low-light scenes. Concretely, we construct a Retinex-induced encoder-decoder with an adaptive denoising mechanism, aiming at covering more practical cases. Different from existing works that directly learn model parameters by using the massive data, we provide a new hyperparameter optimization perspective to formulate a bilevel learning scheme towards general low-light scenarios. This scheme depicts the latent correspondence (i.e., scene-irrelevant encoder) and the respective characteristic (i.e., scene-specific decoder) among different data distributions. Due to the expensive inner optimization, estimating the hyper-parameter gradient exactly can be prohibitive, we develop an approximate hyper-parameter gradient method by introducing the one-step forward approximation and finite difference approximation to ensure the high-efficient inference. Extensive experiments are conducted to reveal our superiority against other state-of-the-art methods. A series of analytical experiments are also executed to verify our effectiveness.

Abstract:
Recent scene text spotters that integrate text detection module and recognition module have made significant progress. However, existing methods encounter two problems. 1). The data imbalance issue between text detection module and text recognition module limits the performance of text spotters. 2). The default left-to-right reading direction leads to errors in unconventional text spotting. In this paper, we propose a novel scene text spotter TDI to solve these problems. Firstly, in order to solve the data imbalance problem, a sample generation algorithm is proposed to generate plenty of samples online for training the text recognition module by using character features and character labels. Secondly, a weakly supervised character generation algorithm is designed to generate character-level labels from word-level labels for the sample generation algorithm and the training of the text detection module. Finally, in order to spot arbitrarily arranged text correctly, a direction perception module is proposed to perceive the reading direction of text instance. Experiments on several benchmarks show that these designs can significantly improve the performance of text spotter. Specifically, our method outperforms state-of-the-art methods on three public datasets in both text detection and end-to-end text recognition, which fully proves the effectiveness and robustness of our method.

Abstract:
In this talk we present recent progress on large-scale learning of multimodal video representations. We start by presenting VideoBert, a joint model for video and language, repurposing the Bert model for multimodal data. This model achieves state-of-the-art results on zero shot prediction and video captioning. Next we show how to extend learning from instruction videos to general movies based on cross-modal supervision. We use movie screenplays to learn a speech to action classifiers and use these classifiers to mine video clips from thousands of hours of movies. We demonstrate a performance comparable or better than fully supervised approaches for action classification. Next we present an approach for video question answering which relies on training from instruction videos and cross-modal supervision with a textual question answer module. We show state-of-the-art results for video question answering without any supervision (zero-shot VQA) and demonstrate that our approach obtains competitive results for pre-training and then fine-tuning on video question answering datasets. We conclude our talk by presenting a recent video feature which is fully transformer based. Our Video Vision Transformer (ViViT) is shown to outperform the state-of-the-art on video classification. Furthermore, it is flexible and allows for performance / accuracy trade-off based on several different architectures.

Abstract:
In industrial enterprises, effective retrieval of three-dimensional (3-D) computer-aided design (CAD) models can greatly save time and cost in new product development and manufacturing, thus, many researchers have focused on it. Recently, many view-based 3D model retrieval methods have been proposed and have achieved state-of-the-art performance. However, most of these methods focus on extracting more discriminative view-level features and effectively aggregating the multi-view images of a 3D model, and the latent relationship among these multi-view images is not fully explored. Thus, we tackle this problem from the perspective of exploiting the relationships between patch features to capture long-range associations among multi-view images. To capture associations among views, in this work, we propose a novel patch convolutional neural network (PCNN ) for view-based 3D model retrieval. Specifically, we first employ a CNN to extract patch features of each view image separately. Second, a novel neural network module named PatchConv is designed to exploit intrinsic relationships between neighboring patches in the feature space to capture long-range associations among multi-view images. Then, an adaptive weighted view layer is further embedded into PCNN to automatically assign a weight to each view according to the similarity between each view feature and the view-pooling feature. Finally, a discrimination loss function is employed to extract the discriminative 3D model feature, which consists of softmax loss values generated by the fusion classifier and the specific classifier. Extensive experimental results on two public 3D model retrieval benchmarks, namely, the ModelNet40, and ModelNet10, demonstrate that our proposed PCNN can outperform state-of-the-art approaches, with mAP values of 93.67%, and 96.23%, respectively.

Abstract:
In this work, we show the Text to Scene system, which can configure 3D indoor scene from natural language. Given a text, the system will organize inclusive semantic message to a graph template, complete the graph with a novel graph-based contextual completion method Contextual ConvE(CConvE) and visulize the graph by arranging 3D models under an object location protocol. In the experiments, qualitative results obtained by the Text to Scene(T2S) system and quantitative evaluation of CConvE compared with other state-of-the-art approaches are reported.

Abstract:
Reconstructing point clouds from images would extremely benefit many practical CV applications, such as robotics, automated vehicles, and Augmented Reality. Fueled by the advances of deep neural network, many deep learning frameworks are proposed to address this problem recently. However, these frameworks generally rely on a large amount of labeled training data (e.g., image and point cloud pairs). Although we usually have numerous 2D images, corresponding 3D shapes are insufficient in practice. In addition, most available 3D data covers only a limited amount of classes, which further restricts the models' generalization ability to novel classes. To mitigate these issues, we propose a novel few-shot single-view point cloud generation framework by considering both class-specific and class-agnostic 3D shape priors. Specifically, we abstract each class by a prototype vector that embeds class-specific shape priors. Class-agnostic shape priors are modeled by a set of learnable shape primitives that encode universal 3D shape information shared across classes. Later, we combine the input image with class-specific prototypes and class-agnostic shape primitives to guide the point cloud generation process. Experiments on the popular ModelNet and ShapeNet datasets demonstrate that our method outperforms state-of-the-art methods in the few-shot setting.

Abstract:
Vehicle counting aims to calculate the number of vehicles in congested traffic scenes. Although object detection and crowd counting have made tremendous progress with the development of deep learning, vehicle counting remains a challenging task, due to scale variations, viewpoint changes, inconsistent location distributions, diverse visual appearances and severe occlusions. In this paper, a well-designed Vehicle Counting Network (VCNet) is novelly proposed to alleviate the problem of scale variation and inconsistent spatial distribution in congested traffic scenes. Specifically, VCNet is composed of two major components: (i) To capture multi-scale vehicles across different types and camera viewpoints, an effective multi-scale density map estimation structure is designed by building an attention-based mask refinement module. The multi-branch structure with hybrid dilated convolution blocks is proposed to assign receptive fields to generate multi-scale density maps. To efficiently aggregate multi-scale density maps, the attention-based mask refinement is well-designed to highlight the vehicle regions, which enables each branch to suppress the scale interference from other branches. (ii) In order to capture the inconsistent spatial distributions, a spatial-awareness block loss (SBL) based on the region-weighted reward strategy is proposed to calculate the loss of different spatial regions including sparse, congested and occluded regions independently by dividing the density map into different regions. Extensive experiments conducted on three benchmark datasets, TRANCOS, VisDrone2019 Vehicle and CVCSet demonstrate that the proposed VCNet outperforms the state-of-the-art approaches in vehicle counting. Moreover, the proposed idea can be applicable for crowd counting, which produces competitive results on ShanghaiTech crowd counting dataset.

Abstract:
Various video understanding tasks (classification, tracking, action detection, etc.) have been extensively studied in the multimedia and computer vision communities over the recent years. While these tasks are important, we think that bridging video and language is a more natural and intuitive way to interact with videos. Caption generation and sentence localization are two representative tasks for connecting video and language, and my research is focused on these two tasks. In this extended abstract, I present approaches for tackling each of these tasks by exploiting fine-grained information in videos, together with ideas about how these two tasks can be connected. So far, my work have demonstrated that these two tasks share a common foundation, and by connecting them to form a cycle, video and language can be more closely bridged. Finally, several challenges and future directions will be discussed.

Abstract:
Point clouds have attracted increasing attention. Significant progress has been made in methods for point cloud analysis, which often requires costly human annotation as supervision. To address this issue, we propose a novel self-contrastive learning for self-supervised point cloud representation learning, aiming to capture both local geometric patterns and nonlocal semantic primitives based on the nonlocal self-similarity of point clouds. The contributions are two-fold: on the one hand, instead of contrasting among different point clouds as commonly employed in contrastive learning, we exploit self-similar point cloud patches within a single point cloud as positive samples and otherwise negative ones to facilitate the task of contrastive learning. On the other hand, we actively learn hard negative samples that are close to positive samples for discriminative feature learning, which are sampled conditional on each anchor patch leveraging on the degree of self-similarity. Experimental results show that the proposed method achieves state-of-the-art performance on widely used benchmark datasets for self-supervised point cloud segmentation and transfer learning for classification.

Abstract:
Predicting block popularity is of crucial importance for data placement in multi-tiered multimedia storage systems. Traditional methods, such as least recently used and exponential smoothing, are commonly employed to predict future block access frequencies and fail to achieve good performance for complex and changing access patterns. Recently, deep neural networks have brought great success to pattern recognition and prediction, which motivates us to introduce deep learning to solve the problem of block popularity prediction. In this paper, we first analyze and verify the temporal and spatial correlations among the multimedia I/O traces. Then, we design a multi-dimension feature to capture such correlations, which serves as the input of the designed deep neural network. A spatial-temporal-sequential neural network (STSNN) and its variants that capture the locality information, time dependency information, and block sequential information are proposed to predict the block popularity. We systematically evaluate our STSNN models against six baseline models from three different categories, i.e., heuristic methods, regression methods and neural network-based methods. Experiment results show that our proposed STSNN models are very promising for predicting block access frequencies under some of Huawei and Microsoft datasets and particularly achieve 2-6 times better performance compared with the baselines in terms of the I/O hit ratio, I/O recall rate and I/O prediction ratio under the Microsoft 64 MB-block dataset.

Abstract:
Fashion editing has drawn increasing research interest with its extensive application prospect. Instead of directly manipulating the real fashion item image, it is more intuitive for designers to modify it via the design draft. In this paper, we model design workflows for a novel task of unaligned fashion editing, allowing the user to edit a fashion item through manipulating its corresponding design draft. The challenge lies in the large misalignment between the real fashion item and the design draft, which could severely degrade the quality of editing results. To address this issue, we propose an Unaligned Fashion Editing Network (UFE-Net). A coarsely rendered fashion item is firstly generated from the edited design draft via a translation module. With this as guidance, we align and manipulate the original unedited fashion item via a novel alignment-driven fashion editing module, and then optimize the details and shape via a reference-guided refinement module. Furthermore, a joint training strategy is introduced to exploit the synergy between the alignment and editing tasks. Our UFE-Net enables the edited fashion item to have semantically consistent geometric shape and realistic details to the edited draft in the edited region, as well as to keep the unedited region intact. Experiments demonstrate our superiority over the competing methods on unaligned fashion editing.

Abstract:
Datasets for training crowd counting deep networks are typically heavy-tailed in count distribution and exhibit discontinuities across the count range. As a result, the de facto statistical measures (MSE, MAE) exhibit large variance and tend to be unreliable indicators of performance across the count range. To address these concerns in a holistic manner, we revise processes at various stages of the standard crowd counting pipeline. To enable principled and balanced minibatch sampling, we propose a novel smoothed Bayesian sample stratification approach. We propose a novel cost function which can be readily incorporated into existing crowd counting deep networks to encourage strata-aware optimization. We analyze the performance of representative crowd counting approaches across standard datasets at per strata level and in aggregate. We analyze the performance of crowd counting approaches across standard datasets and demonstrate that our proposed modifications noticeably reduce error standard deviation. Our contributions represent a nuanced, statistically balanced and fine-grained characterization of performance for crowd counting approaches.

Abstract:
In this paper we reproduce experimental results presented in our earlier work titled "Describing Subjective Experiment Consistency by p-Value P-P Plot" that was presented in the course of the 28th ACM International Conference on Multimedia. The paper aims at verifying the soundness of our prior results and helping others understand our software framework. We present artifacts that help reproduce tables, figures and all the data derived from raw subjective responses that were included in our earlier work. Using the artifacts we show that our results are reproducible. We invite everyone to use our software framework for subjective responses analyses going beyond reproducibility efforts.

Abstract:
Interpreting model knowledge is an essential topic to improve human understanding of deep black-box models. Traditional methods contribute to providing intuitive instance-wise explanations which allocating importance scores for low-level features (e.g, pixels for images). To adapt to the human way of thinking, one strand of recent researches has shifted its spotlight to mining important concepts. However, these concept-based interpretation methods focus on computing the contribution of each discovered concept on the class level and can not precisely give instance-wise explanations. Besides, they consider each concept as an independent unit, and ignore the interactions among concepts. To this end, in this paper, we propose a novel COncept-based NEighbor Shapley approach (dubbed as CONE-SHAP) to evaluate the importance of each concept by considering its physical and semantic neighbors, and interpret model knowledge with both instance-wise and class-wise explanations. Thanks to this design, the interactions among concepts in the same image are fully considered. Meanwhile, the computational complexity of Shapley Value is reduced from exponential to polynomial. Moreover, for a more comprehensive evaluation, we further propose three criteria to quantify the rationality of the allocated contributions for the concepts, including coherency, complexity, and faithfulness. Extensive experiments and ablations have demonstrated that our CONE-SHAP algorithm outperforms existing concept-based methods and simultaneously provides precise explanations for each instance and class.

Abstract:
Most of the current supervised automatic music transcription (AMT) models lack the ability to generalize. This means that they have trouble transcribing real-world music recordings from diverse musical genres that are not presented in the labelled training data. In this paper, we propose a semi-supervised framework, ReconVAT, which solves this issue by leveraging the huge amount of available unlabelled music recordings. The proposed ReconVAT uses reconstruction loss and virtual adversarial training. When combined with existing U-net models for AMT, ReconVAT achieves competitive results on common benchmark datasets such as MAPS and MusicNet. For example, in the few-shot setting for the string part version of MusicNet, ReconVAT achieves F1-scores of 61.0% and 41.6% for the note-wise and note-with-offset-wise metrics respectively, which translates into an improvement of 22.2% and 62.5% compared to the supervised baseline model. Our proposed framework also demonstrates the potential of continual learning on new data, which could be useful in real-world applications whereby new data is constantly available.

Abstract:
Research on smart device privacy has consistently highlighted how privacy is an important concern for users, but they fail to act on their concerns. While this discrepancy between user perceptions and actions has been consistently reported, currently there is a limited understanding of why this is the case or how the situation can be ameliorated. This paper systematically studies how visualizations in privacy assistants can improve the situation, reporting on two studies that explore the users' privacy perceptions in smart device ecosystems. The first study shows that displaying device location and data type reduces the users' privacy perceptions. Participants also weigh the use of media such as online news as a source to inform users about the possible inferences. The second study analyzes participants' preferences to visualize smart device information and privacy policies using augmented reality. Through these two studies, we derive insights and guidelines on how to design effective privacy assistants and to improve user's knowledge of risks associated with data disclosure in smart home scenarios.

Abstract:
Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get corresponding RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines in terms of fluency, relatedness, and diversity.

Abstract:
The majority of deep unsupervised hashing methods usually first construct pairwise semantic similarity information and then learn to map images into compact hash codes while preserving the similarity structure, which implies that the quality of hash codes highly depends on the constructed semantic similarity structure. However, since the features of images for each kind of semantics usually scatter in high-dimensional space with unknown distribution, previous methods could introduce a large number of false positives and negatives for boundary points of distributions in the local semantic structure based on pairwise cosine distances. Towards this limitation, we propose a general distribution-based metric to depict the pairwise distance between images. Specifically, each image is characterized by its random augmentations that can be viewed as samples from the corresponding latent semantic distribution. Then we estimate the distances between images by calculating the sample distribution divergence of their semantics. By applying this new metric to deep unsupervised hashing, we come up with Distribution-based similArity sTructure rEconstruction (DATE). DATE can generate more accurate semantic similarity information by using non-parametric ball divergence. Moreover, DATE explores both semantic-preserving learning and contrastive learning to obtain high-quality hash codes. Extensive experiments on several widely-used datasets validate the superiority of our DATE.

Abstract:
Salient object detection is the pixel-level dense prediction task which can highlight the prominent object in the scene. Recently U-Net framework is widely used, and continuous convolution and pooling operations generate multi-level features which are complementary with each other. In view of the more contribution of high-level features for the performance, we propose a triplet transformer embedding module to enhance them by learning long-range dependencies across layers. It is the first to use three transformer encoders with shared weights to enhance multi-level features. By further designing scale adjustment module to process the input, devising three-stream decoder to process the output and attaching depth features to color features for the multi-modal fusion, the proposed triplet transformer embedding network (TriTransNet) achieves the state-of-the-art performance in RGB-D salient object detection, and pushes the performance to a new level. Experimental results demonstrate the effectiveness of the proposed modules and the competition of TriTransNet.

Abstract:
Facial micro-expression (FME) refers to a brief spontaneous facial movement that can reveal a person's genius emotion. One challenge in facial micro-expression is the lack of data. Fortunately, generative deep neural network models can assist in the creation of desired images. However, the issues for micro-expressions are the facial variations are too subtle to capture, and the limited training data may make feature extraction difficult. To address these issues, we developed a deep motion retargeting and transfer learning based facial micro-expression generation model (DMT-FMEG). First, to capture subtle variations, we employed a deep motion retargeting (DMR) network that can learn keypoints in an unsupervised manner, estimate motions, and generate desired images. Second, to enhance the feature extraction ability, we applied deep transfer learning (DTL) by borrowing knowledge from macro-expression images. We evaluated our method on three datasets, CASME II, SMIC, and SAMM, and found that it showed satisfactory results on all of them. With the effectiveness of the method, we won the second place in the generation task of the FME 2021 challenge.

Abstract:
Micro-expressions (MEs) are significant and effective clues to reveal the true feelings and emotions of human beings, and thus MEs analysis is widely used in different fields such as medical diagnosis, interrogation and security. However, it is extremely difficult to elicit and label MEs, resulting in a lack of sufficient MEs data for MEs analysis. To address this challenge and inspired by the current face generation technology, in this paper we introduce Generative Adversarial Network based on fine-grained Action Units (AUs) modulation to generate MEs sequence (FAMGAN). Specifically, after comprehensively analyzing the factors that lead to inaccurate AU values detection, we performed fine-grained AUs modulation, which includes carefully eliminating the various noises and dealing with the asymmetry of AUs intensity. Additionally, we incorporate super-resolution into our model to enhance the quality of the generated images. Through experiments, we show that the system achieves very competitive results on the Micro-Expression Grand Challenge (MEGC2021).

Abstract:
The quality of video representation directly decides the performance of video related tasks, for both understanding and generation. In this paper, we propose single-modality pretrained feature fusion technique which is composed of reasonable multi-view feature extraction method and designed multi-modality feature fusion strategy. We conduct comprehensive ablation studies on MSR-VTT dataset to demonstrate the effectiveness of proposed method and it surpasses the state-of-the-art methods on both MSR-VTT and VATEX datasets. We further propose the multi-modality pretrained model finetuning technique and dataset augmentation scheme to improve the model's generalization capability. Based on these two proposed pretraining techniques and dataset augmentation scheme, we win the first place in the video captioning track of the MM21 pretraining for video understanding challenge.

Abstract:
Eye contact detection in group conversations is the key to developing artificial mediators that can understand and interact with a group. In this paper, we propose to model a group's appearances and behavioral features to perform eye contact detection for each participant in the conversation. Specifically, we extract the participants' appearance features at the detection moment, and extract the participants' behavioral features based on their motion history image, which is encoded with the participants' body movements within a small time window before the detection moment. In order to attain powerful representative features from these images, we propose to train a Convolutional Neural Network (CNN) to model them. A set of relevant features are obtained from the network, which achieves an accuracy of 0.60 on the validation set in the eye contact detection challenge in ACM MM 2021. Furthermore, our experimental results also demonstrate that making use of both participants' appearance and behavior features can lead to higher accuracy at eye detection than only using one of them.

Abstract:
Data hiding is one widely used approach for proving ownership through blind watermarking. Deep learning has been widely used in data hiding, for which inserting an attack simulation layer (ASL) after the watermarked image has been widely recognized as the most effective approach for improving the pipeline robustness against distortions. Despite its wide usage, the gain of enhanced robustness from ASL is usually interpreted through the lens of augmentation, while our work explores this gain from a new perspective by disentangling the forward and backward propagation of such ASL. We find that the main influential component is forward propagation instead of backward propagation. This observation motivates us to use forward ASL to make the pipeline compatible with non-differentiable and/or black-box distortion, such as lossy (JPEG) compression and photoshop effects. Extensive experiments demonstrate the efficacy of our simple approach.

Abstract:
Feature pyramid networks (FPN) are widely exploited for multi-scale feature fusion in existing advanced object detection frameworks. Numerous previous works have developed various structures for bidirectional feature fusion, all of which are shown to improve the detection performance effectively. We observe that these complicated network structures require feature pyramids to be stacked in a fixed order, which introduces longer pipelines and reduces the inference speed. Moreover, semantics from non-adjacent levels are diluted in the feature pyramid since only features at adjacent pyramid levels are merged by the local fusion operation in a sequence manner. To address these issues, we propose a novel architecture named RCNet, which consists of Reverse Feature Pyramid (RevFP) and Cross-scale Shift Network (CSN). RevFP utilizes local bidirectional feature fusion to simplify the bidirectional pyramid inference pipeline. CSN directly propagates representations to both adjacent and non-adjacent levels to enable multi-scale features more correlative. Extensive experiments on the MS COCO dataset demonstrate RCNet can consistently bring significant improvements over both one-stage and two-stage detectors with subtle extra computational overhead. In particular, RetinaNet is boosted to 40.2 AP, which is 3.7 points higher than baseline, by replacing FPN with our proposed model. On COCO test-dev, RCNet can achieve very competitive performance with a single-model single-scale 50.5 AP.

Abstract:
SUMAC 2021 is the third edition of the workshop on Structuring and Understanding of Multimedia heritAge Contents. It is held in Chengdu, China on October 20th, 2021 and is co-located with the 29th ACM International Conference on Multimedia. Its objective is to present and discuss the latest and most significant trends and challenges in the analysis, structuring and understanding of multimedia contents dedicated to the valorization of heritage, with the emphasis on the unlocking of and access to the big data of the past. A representative scope of Computer Science methodologies dedicated to the processing of multimedia heritage contents and their exploitation is covered by the works presented, with the ambition of advancing and raising awareness about this fully developing research field.

Abstract:
In this workshop, we are addressing the trustworthy AI issues for Multimedia Computing. We aim to bring together researchers in the trustworthy aspects of Multimedia Computing and facilitate discussions in injecting trusts into multimedia to develop trustworthy AI techniques that are reliable and acceptable to multimedia researchers and practitioners. Our scope is at the conjunction of multimedia, computer vision and trustworthy AI, including Explainability, Robustness and Safety, Data Privacy, Accountability and Transparency, and Fairness.

Abstract:
Chinese character style transfer is a very challenging problem because of the complexity of the glyph shapes or underlying structures and large numbers of existed characters, when comparing with English letters. Moreover, the handwriting of calligraphy masters has a more irregular stroke and is difficult to obtain in real-world scenarios. Recently, several GAN-based methods have been proposed for font synthesis, but some of them require numerous reference data and the other part of them have cumbersome preprocessing steps to divide the character into different parts to be learned and transferred separately. In this paper, we propose a simple but powerful end-to-end Chinese calligraphy font generation framework ZiGAN, which does not require any manual operation or redundant preprocessing to generate fine-grained target style characters with few-shot references. To be specific, a few paired samples from different character styles are leveraged to attain fine-grained correlation between structures underlying different glyphs. To capture valuable style knowledge in target and strengthen the coarse-grained understanding of character content, we utilize multiple unpaired samples to align the feature distributions belonging to different character styles. By doing so, only a few target Chinese calligraphy characters are needed to generated expected style transferred characters. Experiments demonstrate that our method has a state-of-the-art generalization ability in few-shot Chinese character style transfer.

Abstract:
Videos grow to be one of the largest mediums on the Internet. E-commerce platforms like Alibaba need to process millions of video data across multimedia (e.g., visual, audio, image, and text) and on a variety of tasks (e.g., retrieval, tagging, and summary) every day. In this work, we aim to develop a once and for all pretraining technique for diverse modalities and downstream tasks. To achieve this, we make the following contributions: (1) We propose a self-supervised multi-modal co-training framework. It takes cross-modal pseudo-label consistency as the supervision and can jointly learn representations of multiple modalities. (2) We introduce several novel techniques (e.g., sliding-window subset sampling, coarse-to-fine clustering, fast spatial-temporal convolution and parallel data transmission and processing) to optimize the training process, making billion-scale stable training feasible. (3) We construct a large-scale multi-modal dataset consisting of 1.4 billion videos (~0.5 PB) and train our framework on it. The training takes only 4.6 days on an in-house 256 GPUs cluster, and it simultaneously produces pretrained video, audio, image, motion, and text networks. (4) Finetuning from our pretrained models, we obtain significant performance gains and faster convergence on diverse multimedia tasks at Alibaba. Furthermore, we also validate the learned representation on public datasets. Despite the domain gap between our commodity-centric pretraining and the action-centric evaluation data, we show superior results against state-of-the-arts.

Abstract:
Comprehensive understanding of key players and actions in multiplayer sports broadcast videos is a challenging problem. Unlike in news or finance videos, sports videos have limited text. While both action recognition for multiplayer sports and detection of players has seen robust research, understanding contextual text in video frames still remains one of the most impactful avenues of sports video understanding. In this work we study extremely accurate semantic text detection and recognition in sports clocks, and challenges therein. We observe unique properties of sports clocks, which makes it hard to utilize general-purpose pre-trained detectors and recognizers, so that text can be accurately understood to the degree of being used to align to external knowledge. We propose a novel distant supervision technique to automatically build sports clock datasets. Along with suitable data augmentations, combined with any state-of-the-art text detection and recognition model architectures, we extract extremely accurate semantic text. Finally, we share our computational architecture pipeline to scale this system in industrial setting and proposed a robust dataset for the same to validate our results.

Abstract:
Recently, smart devices equipped with microphones have become increasingly popular in people's lives. However, when users type on a keyboard near devices with microphones, the acoustic signals generated by different keystrokes may leak the user's privacy. This paper proposes a robust side-channel attack scheme to infer keystrokes on the surrounding keyboard, leveraging the smart devices' microphones. To address the challenge of non-cooperative attacking environments, we propose an efficient scheme to estimate the relative position between the microphones and the keyboard, and extract two robust features from the acoustic signals to alleviate the impact of various victims and keyboards. As a result, we can realize the side-channel attack through acoustic signals, regardless of the exact location of microphones, the victims, and the type of keyboards. We implement the proposed scheme on the commercial smartphone and conduct extensive experiments to evaluate its performance. Experimental results show that the proposed scheme could achieve good performance in predicting keyboard input under various conditions. Overall, we can correctly identify 91.2% of keystrokes with 10-fold cross-validation. When predicting keystrokes from unknown victims, the attack can obtain a Top-5 accuracy of 91.52%. Furthermore, the Top-5 accuracy of predicting keystrokes can reach 72.25% when the victims and keyboards are both unknown. When predicting meaningful contents, we can obtain a Top-5 accuracy of 96.67% for the words entered by the victim.

Abstract:
We address the problem of reconstructing 3D human face from multi-view facial images using Structure-from-Motion (SfM) based on deep neural networks. While recent learning-based monocular view methods have shown impressive results for 3D facial reconstruction, the single-view setting is easily affected by depth ambiguities and poor face pose issues. In this paper, we propose a novel unsupervised 3D face reconstruction architecture by leveraging the multi-view geometry constraints to train accurate face pose and depth maps. Facial images from multiple perspectives of each 3D face model are input to train the network. Multi-view geometry constraints are fused into unsupervised network by establishing loss constraints from spatial and spectral perspectives. To make the trained 3D face have more details, facial landmark detector is explored to acquire massive facial information to constrain face pose and depth estimation. Through minimizing massive landmark displacement distance by bundle adjustment, an accurate 3D face model can be reconstructed. Extensive experiments demonstrate the superiority of our proposed approach over other methods.

Abstract:
Kandinsky Mobile is a mobile device-based interactive artwork that generates and displays the social discussion landscape associated with a social mediaanchor post using a collection of colorfulcircles andconcentric circles. It draws inspiration from the famous abstract geometric art forms of Russian painter Wassily Kandinsky (1866-1944). Intuitively, a circle and a concentric circle represent a social comment and a collection of comments in a discussion thread, respectively. The artwork aims to facilitate user-friendly and effective understanding and visualization of large volumes of comments associated with ananchor post.

Abstract:
Recently, performing semantic editing of an image by modifying a scene graph has been proposed to support high-level image manipulation, and plays an important role for image generation. However, existing methods are all based on bounding boxes, and they suffer from the bounding box constraint. First, a bounding box often involves other instances (e.g, objects or environments) which do not need to be modified, but existing methods manipulate all the contents included in the bounding box. Secondly, prior methods fail to support adding instances when the bounding box of the target instance cannot be provided. To address the two issues above, we propose a novel bounding box free approach, which consists of two parts: a Local Bounding Box Free (Local-BBox-Free) Mask Generation and a Global Bounding Box Free (Global-BBox-Free) Instance Generation. The first part relieves the model of reliance on bounding boxes by generating the mask of the target instance to be manipulated without using the target instance bounding box. This enables our method to be the first to support fully functional image manipulation using scene graphs, including adding, removing, replacing and repositing instances. The second part is designed to synthesize the target instance directly from the generated mask and then paste it back to the inpainted original image using the generated mask, which preserves the unchanged part to the largest extent and precisely controls the target instance generation. Extensive experiments on Visual Genome and COCO-Stuff demonstrate that our model significantly surpasses the state-of-the-art both quantitatively and qualitatively.

Abstract:
Object detection based on pre-trained deep neural networks (DNNs) has achieved impressive performance and enabled many applications. However, DNN-based object detectors are shown to be vulnerable to physical adversarial attacks. Despite that recent efforts have been made to defend against these attacks, they either use strong assumptions or become less effective with pre-trained object detectors. In this paper, we propose adversarial pixel masking (APM), a defense against physical attacks, which is designed specifically for pre-trained object detectors. APM does not require any assumptions beyond the "patch-like" nature of a physical attack and can work with different pre-trained object detectors of different architectures and weights, making it a practical solution in many applications. We conduct extensive experiments, and the empirical results show that APM can significantly improve model robustness without significantly degrading clean performance.

Abstract:
Despite recent significant progress on generative models, context-rich text-to-image synthesis depicting multiple complex objects is still non-trivial. The main challenges lie in the ambiguous semantic of a complex description and the intricate scene of an image with various objects, different positional relationship and diverse appearances. To address these challenges, we propose R-GAN, which can generate reasonable images according to the given text in a human-like way. Specifically, just like humans will first find and settle the essential elements to create a simple sketch, we first capture a monolithic-structural text representation by building a scene graph to find the essential semantic elements. Then, based on this representation, we design a bounding box generator to estimate the layout with position and size of target objects, and a following shape generator, which draws a fine-detailed shape for each object. Different from previous work only generating coarse shapes blindly, we introduce a coarse-to-fine shape generator based on a shape knowledge base. At last, to finish the final image synthesis, we propose a multi-modal geometry-aware spatially-adaptive generator conditioned on the monolithic-structural text representation and the geometry-aware map of the shapes. Extensive experiments on the real-world dataset MSCOCO show the superiority of our method in terms of both quantitative and qualitative metrics.

Abstract:
The goal of this work is to learn discriminative visual representations for lip reading without access to manual text annotation. Recent advances in cross-modal self-supervised learning have shown that the corresponding audio can serve as a supervisory signal to learn effective visual representations for lip reading. However, existing methods only exploit the natural synchronization of the video and the corresponding audio. We find that both video and audio are actually composed of speech-related information, identity-related information, and modal information. To make the visual representations (i) more discriminative for lip reading and (ii) indiscriminate with respect to the identities and modals, we propose a novel self-supervised learning framework called Adversarial Dual-Contrast Self-Supervised Learning (ADC-SSL), to go beyond previous methods by explicitly forcing the visual representations disentangled from speech-unrelated information. Experimental results clearly show that the proposed method outperforms state-of-the-art cross-modal self-supervised baselines by a large margin. Besides, ADC-SSL can outperform its supervised counterpart without any finetune.

Abstract:
The accurate and robust prediction of short-term solar power generation is significant for the management of modern smart grids, where solar power has become a major energy source due to its green and economical nature. However, the solar yield prediction can be difficult to conduct in the real world where hardware and network issues can make the sensors unreachable. Such data missing problem is so prevalent that it degrades the performance of deployed prediction models and even fails the model execution. In this paper, we propose a novel temporal multi-modal variational auto-encoder (TMMVAE) model, to enhance the robustness of short-term solar power yield prediction with missing data. It can impute the missing values in time-series sensor data, and reconstruct them by consolidating multi-modality data, which then facilitates more accurate solar power yield prediction. TMMVAE can be deployed efficiently with an end-to-end framework. The framework is verified at our real-world testbed on campus. The results of extensive experiments show that our proposed framework can significantly improve the imputation accuracy when the inference data is severely corrupted, and can hence dramatically improve the robustness of short-term solar energy yield forecasting.

Abstract:
Visual modality recently has aroused extensive attention in the fields of knowledge graph and multimedia because a lot of real-world knowledge is multi-modal in nature. However, it is currently unclear to what extent the visual modality can improve the performance of knowledge graph tasks over unimodal models, and equally treating structural and visual features may encode too much irrelevant information from images. In this paper, we probe the utility of the auxiliary visual context from knowledge graph representation learning perspective by designing a Relation Sensitive Multi-modal Embedding model, RSME for short. RSME can automatically encourage or filter the influence of visual context during the representation learning. We also examine the effect of different visual feature encoders. Experimental results validate the superiority of our approach compared to the state-of-the-art methods. On the basis of in-depth analysis, we conclude that under appropriate circumstances models are capable of leveraging the visual input to generate better knowledge graph embeddings and vice versa.

Abstract:
This paper proposes a novel approach for Sketch-Based Image Retrieval (SBIR), for which the key is to bridge the gap between sketches and photos in terms of the data representation. Inspired by channel-wise attention explored in recent years, we present a Domain-Aware Squeeze-and-Excitation (DASE) network, which seamlessly incorporates the prior knowledge of sample sketch or photo into SE module and make the SE module capable of emphasizing appropriate channels according to domain signal. Accordingly, the proposed network can switch its mode to achieve a better domain feature with lower intra-class discrepancy. Moreover, while previous works simply focus on minimizing intra-class distance and maximizing inter-class distance, we introduce a loss function, named Multiplicative Euclidean Margin Softmax (MEMS), which introduces multiplicative Euclidean margin into feature space and ensure that the maximum intra-class distance is smaller than the minimum inter-class distance. This facilitates learning a highly discriminative feature space and ensures a more accurate image retrieval result. Extensive experiments are conducted on two widely used SBIR benchmark datasets. Our approach achieves better results on both datasets, surpassing the state-of-the-art methods by a large margin.

Abstract:
In our MM'20 paper, we presented a Kalman filter-based approach for prediction of head motion in 6DoF. The proposed approach was employed in our cloud-based volumetric video streaming system to reduce the interaction latency experienced by the user. In this companion paper, we present the dataset collected for our experiments and our simulation framework that reproduces the obtained experimental results. Our implementation is freely available on Github to facilitate further research.

Abstract:
To augment the TV show in post-production, we propose a novel solution to uncalibrated camera small motion tracking in a dynamic scene that simultaneously reconstructs the sparse 3D scene and computes camera poses and focal lengths of each frame. The critical elements of our approach are a robust image feature tracking strategy in dynamic scenes followed by automatic local-window frames slicing, local and global bundle adjustment optimization initialized by a homography-based uncalibrated relative rotation solver. The proposed method allows us to add the virtual objects (elements) into the reconstructed 3D scene, then composite them back into the original shot while perfectly matched perspective and appear seamless.

Abstract:
Recently, learning-based lossy image compression has achieved notable breakthroughs with their excellent modeling and representation learning capabilities. Comparing to traditional image codecs based on block partitioning and transform, these data-driven approaches with artificial-neural-network (ANN) structures bring significantly different distortion patterns. Efficient objective image quality assessment (IQA) measures play the key role in quantitative evaluation and optimization of image compression algorithms. In this paper, we construct a large-scale image database for quality assessment of compressed images. In the proposed database, 100 reference images are compressed to different quality levels by 10 codecs, involving both traditional and learning-based codecs. Based on this database, we present a benchmark for existing IQA methods and reveal the challenges of IQA on learning-based compression distortions. Furthermore, we develop an objective quality assessment framework in which a self-attention module is adopted to leverage multi-level features from reference and compressed images. Extensive experiments demonstrate the superiority of our method in terms of prediction accuracy. The subjective and objective study of various compressed images also shed lights on the optimization of image compression methods.

Abstract:
We present a vision-based system for real-time pose tracking of the rigid object, it can not only estimate a single pose in six degrees of freedom (6DoF), but also suitable for recovering compound movements. The system is comprised of a monocular camera, and a series of 3D printed platonic solids with squared fiducial markers attached on each single face, which is easy to setup and extend, extra cameras are allowed to incorporate into the pipeline for meeting different requirements. The system realizes object tracking by estimating the pose of the fiducial platonic solid (FPS) which can be fixed onto the surface of the target object. Different sizes and shapes of the platonic solids are allowed to combine with each other to adapt to different application scenarios, this strategy provides enormous flexibility and applicability to our system. In order to track the motion of the fiducial platonic solid accurately, a robust algorithm that combines the fiducial constraint and the statistical constraint is introduced, which is able to handle illumination changes, motion blur and partial occlusion. We evaluate the performance of the proposed approach with qualitative and quantitative experiments, in addition, a couple of mixed reality (MR) applications are developed for demonstrating the effectiveness of the system.

Abstract:
Image retrieval with text feedback is an emerging research topic with the objective of integrating inputs from multiple modalities as queries. In this paper, queries contain a reference image plus text feedback that describes modifications between this image and the desired image. The existing work for this task mainly focuses on designing a new fusion network to compose the image and text. Still, little research pays attention to the modality gap caused by the inconsistent distribution of features from different modalities, which dramatically influences the feature fusion and similarity learning between queries and the desired image. We propose a Distribution-Aligned Text-based Image Retrieval (DATIR) model, which consists of attention mutual information maximization and hierarchical mutual information maximization, to bridge this gap by increasing non-linear statistic dependencies between representations of different modalities. More specifically, attention mutual information maximization narrows the modality gap between different input modalities by maximizing mutual information between the text representation and its semantically consistent representation captured from the reference image and the desired image by the difference transformer. For hierarchical mutual information maximization, it aligns distributions of features from the image modality and the fusion modality by estimating mutual information between a single-layer representation in the fusion network and the multi-level representations in the desired image encoder. Extensive experiments on three large-scale benchmark datasets demonstrate that we can bridge the modality gap between different modalities and achieve state-of-the-art retrieval performance.

Abstract:
This paper investigates the 4-D light field (LF) reconstruction from 2-D measurements captured by the coded aperture camera. To tackle such an ill-posed inverse problem, we propose a cycle-consistent reconstruction network (CR-Net). To be specific, based on the intrinsic linear imaging model of the coded aperture, CR-Net reconstructs an LF through progressively eliminating the residuals between the projected measurements from the reconstructed LF and input measurements. Moreover, to address the crucial issue of extracting representative features from high-dimensional LF data efficiently and effectively, we formulate the problem in a probability space and propose to approximate a posterior distribution of a set of carefully-defined LF processing events, including both layer-wise spatial-angular feature extraction and network-level feature aggregation. Through droppath from a densely-connected template network, we derive an adaptively learned spatial-angular fusion strategy, which is sharply contrasted with existing manners that combine spatial and angular features empirically. Extensive experiments on both simulated measurements and measurements by a real coded aperture camera demonstrate the significant advantage of our method over state-of-the-art ones, i.e., our method improves the reconstruction quality by 4.5 dB.

Abstract:
While Transformers have yielded impressive results for video classification on large datasets recently, simpler models without the transformer architecture can be promising for small datasets. In this paper, we propose three major techniques to improve feature quality and another three to alleviate overfitting in an attempt to make lightweight models achieve higher performances. In particular, we enhance features of Image Flow by combining temporal information, multi-level features of CNNs, and Text embedding. We alleviate overfitting by removing redundant modal, fine-tuning dropout rate, and augmenting data. In the 2021 Tencent Advertisement Algorithm Competition, the baseline model achieved a GAP score of 0.8019 offline with our strategies. It's worth mentioning that our design works well with the 10-fold method, which produces our final submitting model with a GAP score of 0.8210 online, ranking the 5th among 287 teams. In addition, our solution is among the fastest within the top 10 teams.

Abstract:
Vector Quantized Variational AutoEncoder (VQ-VAE) models realize fast image generation by encoding and quantifying the raw input in the single-level or hierarchical compressed latent space. However, the learned representations are not expert in capturing complex relations existed, while one usually adopts domain-specific autoregressive models to fit a prior distribution for two stages of learning. In this work, we propose VQMG, a novel and unified framework for multi-hops relational reasoning and explicit representation learning. By introducing Multi-hops Graph Convolution Networks (MGCN), complicated relations from hierarchical latent space are effectively captured by Inner Graph, while the fitting of autoregressive prior are performed coherently by Outer Graph to promote the performance. Experiments on multimedia tasks including Point cloud segementation, Stroke-level text detection and Image generation verify the efficiency and applicability of our approach.

Abstract:
Animation production workflows centred around motion capture techniques often require animators to edit the motion for various artistic and technical reasons. This process generally uses a set of keyframes. Unsupervised keyframe selection methods for motion capture sequences are highly demanded to reduce the laborious annotations. However, most existing methods are optimization-based, which cause the issues of flexibility and efficiency and eventually constrains the interactions and controls with animators. To address these limitations, we propose a novel graph based deep reinforcement learning method for efficient unsupervised keyframe selection. First, a reward function is devised in terms of reconstruction difference by comparing the original sequence and the interpolated sequence produced by the keyframes. The reward complies with the requirements of the animation pipeline satisfying: 1) incremental reward to evaluate the interpolated keyframes immediately; 2) order insensitivity for consistent evaluation; and 3) non-diminishing return for comparable rewards between optimal and sub-optimal solutions. Then by representing each skeleton frame as a graph, a graph-based deep agent is guided to heuristically select keyframes to maximize the reward. During the inference it is no longer necessary to estimate the reconstruction difference, and the evaluation time can be reduced significantly. The experimental results on the CMU Mocap dataset demonstrate that our proposed method is able to select keyframes at a high efficiency without clearly compromising the quality in comparison with the state-of-the-art methods.

Abstract:
Cross-validation (CV) is a ubiquitous model-agnostic tool for assessing the error of machine learning. However, it has high complexity due to the requirement of multiple times of learner training especially in multimedia tasks with huge amounts of data. In this paper, we provide a unified framework to approximate the CV error for various common multimedia tasks such as supervised, semi-supervised and pairwise learning which requires training only once. Moreover, we study the theoretical performance of the proposed approximate CV and provide an explicit finite-sample error bound. Experimental results on several datasets demonstrate that our approximate CV has no statistical discrepancy from the original CV, but can significantly improve the efficiency, which is a great advantage in model selection.

Abstract:
Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired stylized data. Many studies focus on unsupervised approaches, without considering from the perspective of data augmentation. We begin with the observation that people may recall similar emotions when they are in similar scenes, and often express similar emotions with similar style phrases, which underpins our data augmentation idea. In this paper, we propose a novel Extract-Retrieve-Generate data augmentation framework to extract style phrases from small-scale stylized sentences and graft them to large-scale factual captions. First, we design the emotional signal extractor to extract style phrases from small-scale stylized sentences. Second, we construct the plugable multi-modal scene retriever to retrieve scenes represented with pairs of an image and its stylized caption, which are similar to the query image or caption in the large-scale factual data. In the end, based on the style phrases of similar scenes and the factual description of the current scene, we build the emotion-aware caption generator to generate fluent and diversified stylized captions for the current scene. Extensive experimental results show that our framework can alleviate the data scarcity problem effectively. It also significantly boosts the performance of several existing image captioning models in both supervised and unsupervised settings, which outperforms the state-of-the-art stylized image captioning methods in terms of both sentence relevance and stylishness by a substantial margin.

Abstract:
Semantic segmentation has been continuously investigated in the last ten years, and majority of the established technologies are based on supervised models. In recent years, image-level weakly supervised semantic segmentation (WSSS), including single- and multi-stage process, has attracted large attention due to data labeling efficiency. In this paper, we propose to embed affinity learning of multi-stage approaches in a single-stage model. To be specific, we introduce an adaptive affinity loss to thoroughly learn the local pairwise affinity. As such, a deep neural network is used to deliver comprehensive semantic information in the training phase, whilst improving the performance of the final prediction module. On the other hand, considering the existence of errors in the pseudo labels, we propose a novel label reassign loss to mitigate over-fitting. Extensive experiments are conducted on the PASCAL VOC 2012 dataset to evaluate the effectiveness of our proposed approach that outperforms other standard single-stage methods and achieves comparable performance against several multi-stage methods.

Abstract:
Deep neural networks (DNNs) have demonstrated phenomenal success in image classification applications and are widely adopted in multimedia internet of things (IoT) use cases, such as smart home systems. To compensate for the limited resources on the IoT devices, the computation-intensive image classification tasks are often offloaded to remote cloud services. However, the offloading-based image classification could pose significant security and privacy concerns to the user data and the DNN model, leading to effective adversarial attacks that compromise the classification accuracy. The existing defense methods either impact the original functionality or result in high computation or model re-training overhead. In this paper, we develop a novel defense approach, namely Fake Gradient, to protect the privacy of the data and defend against adversarial attacks based on encryption of the output. Fake Gradient can hide the real output information by generating fake classes and further mislead the adversarial perturbation generation based on fake gradient knowledge, which helps maintain a high classification accuracy on the perturbed data. Our evaluations using ImageNet and 7 popular DNN models indicate that Fake Gradient is effective in protecting the privacy and defending against adversarial attacks targeting image classification applications.

Abstract:
Detecting abnormal activities in real-world surveillance videos is an important yet challenging task as the prior knowledge about video anomalies is usually limited or unavailable. Despite that many approaches have been developed to resolve this problem, few of them can capture the normal spatio-temporal patterns effectively and efficiently. Moreover, existing works seldom explicitly consider the local consistency at frame level and global coherence of temporal dynamics in video sequences. To this end, we propose Convolutional Transformer based Dual Discriminator Generative Adversarial Networks (CT-D2GAN) to perform unsupervised video anomaly detection. Specifically, we first present a convolutional transformer to perform future frame prediction. It contains three key components, i.e., a convolutional encoder to capture the spatial information of the input video clips, a temporal self-attention module to encode the temporal dynamics, and a convolutional decoder to integrate spatio-temporal features and predict the future frame. Next, a dual discriminator based adversarial training procedure, which jointly considers an image discriminator that can maintain the local consistency at frame-level and a video discriminator that can enforce the global coherence of temporal dynamics, is employed to enhance the future frame prediction. Finally, the prediction error is used to identify abnormal video frames. Thoroughly empirical studies on three public video anomaly detection datasets, i.e., UCSD Ped2, CUHK Avenue, and Shanghai Tech Campus, demonstrate the effectiveness of the proposed adversarial spatio-temporal modeling framework.

Abstract:
In multimodal tasks, the importance of text and image modal information often varies for different input cases. To model the difference of importance of different modal information, we propose a high-performance and highly general Dual-Router Dynamic Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion unit. The text router and image router in Dual-Router take text modal information and image modal information respectively, and MWF-Layer is responsible to determine the importance of modal information. Based on the result of the determination, MWF-Layer generates fused weights for the subsequent experts fusion. Experts can adopt a variety of backbones that match the current multimodal or unimodal task. DRDF features high generality and modularity, and we test 12 backbones such as Visual BERT and their corresponding DRDF instances on the multimodal dataset Hateful memes, and unimodal datasets CIFAR10, CIFAR100, and TinyImagenet. Our DRDF instance outperforms those backbones. We also validate the effectiveness of components of DRDF by ablation studies, and discuss the reasons and ideas of DRDF design.

Abstract:
Fine-grained sketch-based image retrieval is considered as an ideal alternative to keyword-based image retrieval and image search by image due to the rich and easily accessible characteristics of sketches. Previous works always follow a paradigm that first extracting image global feature with convolution neural network and then optimizing the model with triplet loss. Many efforts on narrowing the domain gap and extracting discriminating features are made by these works. However, they ignored that the global feature is not good at capturing fine-grained details. In this paper, we emphasize the local features are more discriminating than global feature in FG-SBIR and explore an effective way to utilize local features. Specifically, Local Aligned Network (LA-Net) is proposed first, which solves FG-SBIR by directly aligning the mid-level local features. Experiment manifests it can beat all previous baselines and is easy to implement. LA-Net is hoped to be a new strong baseline for FG-SBIR. Next, Dynamic Local Aligned Network (DLA-Net) is proposed to enhance LA-Net. The question of spatial misalignment caused by the abstraction of the sketch is not considered by LA-Net. To solve this question, a dynamic alignment mechanism is introduced into LA-Net. This new mechanism makes the sketch interact with the photo and dynamically decide where to align according to the different photos. The Experiment indicates DLA-Net successfully addresses the question of spatial misalignment. It gains a significant performance boost over LA-Net and outperforms the state-of-the-art in FG-SBIR. To the best of our knowledge, DLA-Net is the first model that beats humans on all datasets---QMUL FG-SBIR, QMUL Handbag, and Sketchy.

Abstract:
Deepfakes, i.e.synthetic or "fake" media content generated using deep learning, are a double-edged sword. On one hand, they pose new threats and risks in the form of scams, fraud, disinformation, social manipulation, or celebrity porn. On the other hand, deepfakes have just as many meaningful and beneficial applications - they allow us to create and experience things that no longer exist, or that have never existed, enabling numerous exciting applications in entertainment, education, and even privacy.

Abstract:
Product identification has become a very important component in the modern E-commerce shopping system. Consumers could enjoy watching livingstreaming and buying products that livestream hosts recommended. However, with hundreds of products presented in a livingstreaming video, finding the specific product could be laboursome for consumers. Hence, automatic product identification is desired in livingstreaming based E-commerce system. Compared with the image-based visual searching system, the complicated contents in the livestreaming videos make the identification even more challenging. To promote the research on product identification in livestreaming, we present the largest multimodal product retrieval dataset named "Watch and Buy" (WAB) and launch the multimodal product retrieval challenge. We hope this workshop could help researchers further advance the performance and applicability of livestreaming product identification in real-world systems.

Abstract:
In this paper, we aim to reconstruct a full 3D human shape from a single image. Previous vertex-level and parameter regression approaches reconstruct 3D human shape based on a pre-defined adjacency matrix to encode positive relations between nodes. The deep topological relations for the surface of the 3D human body are not carefully exploited. Moreover, the performance of most existing approaches often suffer from domain gap when handling more occlusion cases in real-world scenes. In this work, we propose a Deep Mesh Relation Capturing Graph Convolution Network, DC-GNet, with a shape completion task for 3D human shape reconstruction. Firstly, we propose to capture deep relations within mesh vertices, where an adaptive matrix encoding both positive and negative relations is introduced. Secondly, we propose a shape completion task to learn prior about various kinds of occlusion cases. Our approach encodes mesh structure from more subtle relations between nodes in a more distant region. Furthermore, our shape completion module alleviates the performance degradation issue in the outdoor scene. Extensive experiments on several benchmarks show that our approach outperforms the previous 3D human pose and shape estimation approaches.

Abstract:
Video-based person re-identification (Re-ID) aims to match the target pedestrians under non-overlapping camera system by video tracklets. The key issue of video Re-ID focuses on exploring effective spatio-temporal features. Generally, the spatio-temporal information of a video sequence can be divided into two aspects: the discriminative information in each frame and the shared information over the whole sequence. To make full use of the rich information in video sequences, this paper proposes a Discrete Cosine Transform based Information Enhancement Network (DCT-IEN) to achieve more comprehensive spatio-temporal representation from frequency domain. Inspired by the principle that average pooling is one of the special frequency components in DCT (the lowest frequency component), DCT-IEN first adopts discrete cosine transform to convert the extracted feature maps into frequency domain, thereby retaining more information that embedded in different frequency components. With the help of DCT frequency spectrum, two branches are adopted to learn the final video representation: Frequency Selection Module (FSM) and Lowest Frequency Enhancement Module (LFEM). FSM explores the most discriminative features in each frame by aggregating different frequency components with attention mechanism. LFEM enhances the shared feature over the whole video sequence by frame feature regularization. By fusing these two kinds of features together, DCT-IEN finally achieves comprehensive video representation. We conduct extensive experiments on two widely used datasets. The experimental results verify our idea and demonstrate the effectiveness of DCT-IEN for video-based Re-ID.

Abstract:
Automatic assessment of breast cancer metastases plays an important role to help pathologist reduce the time-consuming work in histopathological whole-slide image diagnosis. From the utilization of knowledge point of view, the low-magnification level and high-magnification level are carefully checked by the pathologists for tumor pattern and cell tumor characteristic. In this paper, we propose a novel automatic patient-level tumor segmentation and classification method, which makes full use of the diagnosis knowledge clues from pathologists. For tumor segmentation, a multi-level view DeepLabV3+ (MLV-DeepLabV3+) is designed to explore the distinguishing features of cell characteristics between tumor and normal tissue. Furthermore, the expert segmentation models are selected and integrated by Pareto-front optimization to imitate the expert consultation to get perfect diagnosis. For wholeslide classification, multi-level magnifications are adaptive checked to focus on the effective features in different magnification. The experimental results demonstrate that our pathologist knowledge-based automatic assessment of whileslide image is effective and robust on the public benchmark dataset.

Abstract:
The estimation of 3D human poses from time-synchronized, calibrated multi-view video usually consists of two steps: (1) a 2D detector to locate the 2D coordinate point position of the joint via heatmaps for each frame and (2) a post-processing method such as the recursive pictorial structure model or robust triangulation to obtain 3D coordinate points. However, most existing methods are based on a single frame only. They do not take advantage of the temporal characteristics of the video sequence itself, and must rely on post-processing algorithms. They are also susceptible to human self-occlusion, and the generated sequences suffer from jitter. Therefore, we propose a network model incorporating spatial and temporal features. Using a coarse-to-fine approach, the proposed heatmap temporal network (HTN) generates temporal heatmap information, with an occlusion heatmap filter used to filter low-quality heatmaps before they are sent to the HTN. The heatmap fusion and the triangulation weights are dynamically adjusted, and intermediate supervision is employed to enable better integration of temporal and spatial information. Our network is also end-to-end differentiable. This overcomes the long-standing problem of skeleton jitter being generated and ensures that the sequence is smooth and stable.

Abstract:
Pruning can remove redundant parameters and structures of Deep Neural Networks (DNNs) to reduce inference time and memory overhead. As an important component of neural networks, the feature map (FM) has stated to be adopted for network pruning. However, the majority of FM-based pruning methods do not fully investigate effective knowledge in the FM for pruning. In addition, it is challenging to design a robust pruning criterion with a small number of images and achieve parallel pruning due to the variability of FMs. In this paper, we propose Adaptive Knowledge Extraction for Channel Pruning (AKECP), which can compress the network fast and efficiently. In AKECP, we first investigate the characteristics of FMs and extract effective knowledge with an adaptive scheme. Secondly, we formulate the effective knowledge of FMs to measure the importance of corresponding network channels. Thirdly, thanks to the effective knowledge extraction, AKECP can efficiently and simultaneously prune all the layers with extremely few or even one image. Experimental results show that our method can compress various networks on different datasets without introducing additional constraints, and it has advanced the state-of-the-arts. Notably, for ResNet-110 on CIFAR-10, AKECP achieves 59.9% of parameters and 59.8% of FLOPs reduction with negligible accuracy loss. For ResNet-50 on ImageNet, AKECP saves 40.5% of memory footprint and reduces 44.1% of FLOPs with only 0.32% of Top-1 accuracy drop.

Abstract:
In this art project, we create Affective Color Fields: an interactive artifact that takes in a user's narrative of their emotional experiences and dynamically transforms it into Rothkoesque color fields through emotion classification. Inspired by Mark Rothko's abstract depiction of human emotions and Merleau-Ponty's phenomenological inquiry, we wish to establish an intimate relationship between interactive art and the subject by employing user's own interpretation and framing of life events. Through the performative and improvisational art-making process, users can playfully appropriate our artifact for a rich and personal aesthetic experience.

Abstract:
Despite recent progress on semantic segmentation, there still exist huge challenges in high or ultra-high resolution images semantic segmentation. Although the latest collaborative global-local semantic segmentation methods such as GLNet [4] and PPN [18] have achieved impressive results, they are inefficient and not fit for practical applications. Thus, in this paper, we propose a novel and efficient collaborative global-local framework on the basis of PPN named Faster-PPN for high or ultra-high resolution images semantic segmentation which makes a better trade-off between the efficient and effectiveness towards the real-time speed. Specially, we propose Dual Mutual Learning to improve the feature representation of global and local branches, which conducts knowledge distillation mutually between the global and local branches. Furthermore, we design the Pixel Proposal Fusion Module to conduct the fine-grained selection mechanism which further reduces the redundant pixels for fusion resulting in the improvement of inference speed. The experimental results on three challenging high or ultra-high resolution datasets DeepGlobe, ISIC and BACH demonstrate that Faster-PPN achieves the best performance on accuracy, inference speed and memory usage compared with state-of-the-art approaches. Especially, our method achieves real-time and near real-time speed with 36 FPS and 17.7 FPS on ISIC and DeepGlobe, respectively.

Abstract:
In intra coding, template matching prediction is an effective method to reduce the non-local redundancy inside image content. However, the prediction indicated by the best template matching is not always the actually best prediction. To solve this problem, we propose a method, which merges multiple template matching predictions through a convolutional neural network with attention module. The convolutional neural network aims at exploring different combinations of the candidate template matching predictions, and the attention module focuses on determining the most significant prediction candidate. Besides, the spatial module in attention mechanism can be utilized to model the relationship between the original pixels in current block and the reconstructed pixels in adjacent regions (template). Compared to the directional intra prediction and traditional template matching prediction, our method can provide a unified framework to generate prediction with high accuracy. The experimental results show that, compared the averaging strategy, the BD-rate reductions can reach up to 4.7%, 5.5% and 18.3% on the classic standard sequences (classB-classF), SIQAD dataset (screen content), and Urban100 dataset (natural scenes) respectively, while the average bit rate saving are 0.5%, 2.7% and 1.8%, respectively.

Abstract:
Active learning has recently attracted increasing attention in the task of person re-identification, due to its unique scalability that not only maximally reduces the annotation cost but also retains the satisfying performance. Although some preliminary active learning methods have been explored in scalable person re-identification task, they have the following two problems: 1) the inefficiency in the selection process of image pairs due to the huge search space, and 2) the ineffectiveness caused by ignoring the impact of unlabeled data in model training. Considering that, we propose a Multi-grained Active Semi-Supervised learning framework, named MASS, to address the scalable person re-identification problem existing in the practical scenarios. Specifically, we firstly design a cluster-scatter procedure to alleviate the inefficiency problem, which consists of two components: cluster step and scatter step. The cluster step shrinks the search space into individual small clusters by a coarse-grained clustering method, and the subsequent scatter step further mines the hard distinguished image pairs from unlabelled set to purify the learned clusters by a novel centrality-based adaptive purification strategy. Afterward, we introduce a customized purification loss for the purified clustering, which utilizes the complementary information in both labeled and unlabeled data to optimize the model for solving the ineffectiveness problem. The cluster-scatter procedure and the model optimization are performed in an iterative fashion to achieve the promising performance while greatly reducing the annotation cost. Extensive experimental results have demonstrated that MASS can even achieve a competitive performance with fully supervised methods in the case of extremely less annotation requirements.

Abstract:
In this paper, we introduce iART: an open Web platform for art-historical research that facilitates the process of comparative vision. The system integrates various machine learning techniques for keyword- and content-based image retrieval as well as category formation via clustering. An intuitive GUI supports users to define queries and explore results. By using a state-of-the-art cross-modal deep learning approach, it is possible to search for concepts that were not previously detected by trained classification models. Art-historical objects from large, openly licensed collections such as Amsterdam Rijksmuseum and Wikidata are made available to users.

Abstract:
Oriented bounding boxes are widely used for object detection in aerial images. Existing oriented object detection methods typically follow the general object detection paradigm by adding an extra rotation angle on the horizontal bounding boxes. However, the angular periodicity incurs the difficulty in angle regression and rotation sensitivity on bounding boxes. In this paper, we propose a new anchor-free oriented object detector, Polar Ray Network (PRNet), where object keypoints are represented by polar coordinates without angle regression. Our PRNet learns a set of polar rays from the object center to boundary with predefined equal-distributed angles. We introduce a dynamic PointConv module to optimize the regression of polar ray by incorporating object corner features. Furthermore, a classification feature guidance module is presented to improve the classification accuracy by incorporating more spatial contents from polar rays. Experimental results on two public datasets, i.e., DOTA and HRSC2016, demonstrate that the proposed PRNet significantly outperforms existing anchor-free detectors, and shows highly competitiveness with the state-of-the-art two-stage anchor-based methods.

Abstract:
Computational food analysis (CFA) naturally requires multi-modal evidence of a particular food, e.g., images, recipe text, etc. A key to making CFA possible is multi-modal shared representation learning, which aims to create a joint representation of the multiple views (text and image) of the data. In this work we propose a method for food domain cross-modal shared representation learning that preserves the vast semantic richness present in the food data. Our proposed method employs an effective transformer-based multilingual recipe encoder coupled with a traditional image embedding architecture. Here, we propose the use of imperfect multilingual translations to effectively regularize the model while at the same time adding functionality across multiple languages and alphabets. Experimental analysis on the public Recipe1M dataset shows that the representation learned via the proposed method significantly outperforms the current state-of-the-arts (SOTA) on retrieval tasks. Furthermore, the representational power of the learned representation is demonstrated through a generative food image synthesis model conditioned on recipe embeddings. Synthesized images can effectively reproduce the visual appearance of paired samples, indicating that the learned representation captures the joint semantics of both the textual recipe and its visual content, thus narrowing the modality gap.

Abstract:
Most Visual Question Answering (VQA) models are faced with language bias when learning to answer a given question, thereby failing to understand multimodal knowledge simultaneously. Based on the fact that VQA samples with different levels of language bias contribute differently for answer prediction, in this paper, we overcome the language prior problem by proposing a novel Language Bias driven Curriculum Learning (LBCL) approach, which employs an easy-to-hard learning strategy with a novel difficulty metric Visual Sensitive Coefficient (VSC). Specifically, in the initial training stage, the VQA model mainly learns the superficial textual correlations between questions and answers (easy concept) from more-biased examples, and then progressively focuses on learning the multimodal reasoning (hard concept) from less-biased examples in the following stages. The curriculum selection of examples on different stages is according to our proposed difficulty metric VSC, which is to evaluate the difficulty driven by the language bias of each VQA sample. Furthermore, to avoid the catastrophic forgetting of the learned concept during the multi-stage learning procedure, we propose to integrate knowledge distillation into the curriculum learning framework. Extensive experiments show that our LBCL can be generally applied to common VQA baseline models, and achieves remarkably better performance on the VQA-CP v1 and v2 datasets, with an overall 20% accuracy boost over baseline models.

Abstract:
The subjective evaluation of music generation techniques has been mostly done with questionnaire-based listening tests while ignoring the perspectives from music composition, arrangement, and soundtrack editing. In this paper, we propose an editing test to evaluate users' editing experience of music generation models in a systematic way. To do this, we design a new music style transfer model combining the non-chronological inference architecture, autoregressive models and the Transformer, which serves as an improvement from the baseline model on the same style transfer task. Then, we compare the performance of the two models with a conventional listening test and the proposed editing test, in which the quality of generated samples is assessed by the amount of effort (e.g., the number of required keyboard and mouse actions) spent by users to polish a music clip. Results on two target styles indicate that the improvement over the baseline model can be reflected by the editing test quantitatively. Also, the editing test provides profound insights which are not accessible from usual listening tests. The major contribution of this paper is the systematic presentation of the editing test and the corresponding insights, while the proposed music style transfer model based on state-of-the-art neural networks represents another contribution.

Abstract:
In this paper, we address the problem that selectively segments the actor and its action in the video clip given the sentence description. The main challenge is to match the local semantic features of the video with the heterogeneous textual features. A widely used language processing method in previous works is to leverage bi-LSTM and self-attention, which fixed the attention of the sentence and neglected the personality of the video, leading the attention of the sentence mismatch the most discriminative feature of the video. The proposed algorithm in this paper allows the sentence to learn the most discriminative features of the video, remarkably improving the accuracy of matching and segmentation. Specifically, we propose a cascade cross-modal attention to leverage two perspectives visual features to attend language from coarse to fine to generate the discriminative vision-aware language features. Moreover, equipping our framework with a contrastive learning method and a designed hard negative mining strategy benefits our proposed network from identifying the positive sample from numbers of negatives, and further improving the performance. To demonstrate the effectiveness of our approach, we conduct experiments on two datasets: A2D Sentences and J-HMDB Sentences. Experimental results show that our method significantly improves the performance over recent state-of-the-art methods.

Abstract:
Fine-grained visual recognition tasks typically require training data with reliable acquisition and annotation processes. Acquiring such datasets with precise fine-grained annotations is very expensive and time-consuming. Conversely, a vast amount of web data is relatively easy to obtain with nearly no human effort. Nevertheless, the presence of label noise in web images becomes a huge obstacle for training robust fine-grained recognition models. In this work, we investigate the noisy label problem and propose a method that can specifically distinguish in- and out-of-distribution noisy samples. It can purify the web training data by discarding out-of-distribution noisy images and relabeling in-distribution ones. After purification, we can train the model on a less noisy web training set to achieve better robustness and performance. Extensive experiments on three real-world web datasets for fine-grained visual recognition demonstrate the superiority of our approach.

Abstract:
Deep learning based image classification models are shown vulnerable to adversarial attacks by injecting deliberately crafted noises to clean images. To defend against adversarial attacks in a training-free and attack-agnostic manner, this work proposes a novel and effective reconstruction-based defense framework by delving into deep image prior (DIP). Fundamentally different from existing reconstruction-based defenses, the proposed method analyzes and explicitly incorporates the model decision process into our defense. Given an adversarial image, firstly we map its reconstructed images during DIP optimization to the model decision space, where cross-boundary images can be detected and on-boundary images can be further localized. Then, adversarial noise is purified by perturbing on-boundary images along the reverse direction to the adversarial image. Finally, on-manifold images are stitched to construct an image that can be correctly predicted by the victim classifier. Extensive experiments demonstrate that the proposed method outperforms existing state-of-the-art reconstruction-based methods both in defending white-box attacks and defense-aware attacks. Moreover, the proposed method can maintain a high visual quality during adversarial image reconstruction.

Abstract:
The rapid development of real-time interactive applications has brought new challenges to ensuring user's quality-of-experience (QoE). These applications have deadline requirements and the characteristics of block transmission. This not only requires high throughput and low latency, but also needs to consider the transmission sequence between data blocks to ensure that the data blocks arrive before the deadline. However, the existing congestion control algorithms and scheduling algorithms do not fit well with the characteristics of real-time interactive applications. Therefore, a high-performance hybrid control algorithm is urgently needed to ensure the user's QoE. In response to this problem, this paper proposes a scheduling algorithm based on transmission profit and a responsive congestion algorithm, and compared simulations in a variety of scenarios. Experimental results show that Phoenix performs very well in a variety of scenarios, and the average QoE is 33.7% higher than BBR+EDF.

Abstract:
To comprehend long duration videos, the deep video understanding (DVU) task is proposed to recognize interactions on scene level and relationships on movie level and answer questions on these two levels. In this paper, we propose a solution to the DVU task which applies joint learning of interaction and relationship prediction and multimodal feature fusion. Our solution handles the DVU task with three joint learning sub-tasks: scene sentiment classification, scene interaction recognition and super-scene video relationship recognition, all of which utilize text features, visual features and audio features, and predict representations in semantic space. Since sentiment, interaction and relationship are related to each other, we train a unified framework with joint learning. Then, we answer questions for video analysis in DVU according to the results of the three sub-tasks. We conduct experiments on the HLVU dataset to evaluate the effectiveness of our method.

Abstract:
It is known that deep neural models are vulnerable to adversarial attacks. Digital attacks can craft imperceptible perturbations but lack of the ability to apply in physical environment. To address this issue, efforts have been investigated to study physical patch attacks in the physical world, especially for object detection models. Previous works mostly focus on evading the detection model itself but ignore the impact of human observers. In this paper, we study legitimate adversarial attacks that evade both human eyes and detection models in the physical world. To this end, we delve into the issue of patch rationality, and propose some indicators for evaluating the rationality of physical adversarial patches. Besides, we propose a novel framework with a two-stage training strategy to generate our legitimate adversarial patches (LAPs). Both in numerical simulations and physical experiments our LAPs have significant attack effects and visual rationality.

Abstract:
Text-based visual question answering (TextVQA) requires analyzing both the visual contents and texts in an image to answer a question, which is more practical than general visual question answering (VQA). Existing efforts tend to regard optical character recognition (OCR) as a pre-processing and then combine it with a VQA framework. It makes the performance of multimodal reasoning and question answering highly depend on the accuracy of OCR. In this work, we address this issue with two perspectives. First, we take advantages of multimodal cues to complete the semantic information of texts. A visually enhanced text embedding is proposed to enable understanding of texts without accurately recognizing them. Second, we further leverage rich contextual information to modify the answer texts even if the OCR module does not correctly recognize them. In addition, the visual objects are endued with semantic representations to enable objects in the same semantic space as OCR tokens. Equipped with these techniques, the cumulative error propagation caused by poor OCR performance is effectively suppressed. Extensive experiments on TextVQA and ST-VQA datasets demonstrate that our approach achieves the state-of-the-art performance in terms of accuracy and robustness.

Abstract:
3D human mesh recovery from point clouds is essential for various tasks, including AR/VR and human behavior understanding. Previous works in this field either require high-quality 3D human scans or sequential point clouds, which cannot be easily applied to low-quality 3D scans captured by consumer-level depth sensors. In this paper, we make the first attempt to reconstruct reliable 3D human shapes from single-frame partial point clouds. To achieve this, we propose an end-to-end learnable method, named VoteHMR. The core of VoteHMR is a novel occlusion-aware voting network that can first reliably produce visible joint-level features from the input partial point clouds, and then complete the joint-level features through the kinematic tree of the human skeleton. Compared with holistic features used by previous works, the joint-level features can not only effectively encode the human geometry information but also be robust to noisy inputs with self-occlusions and missing areas. By exploiting the rich complementary clues from the joint-level features and global features from the input point clouds, the proposed method encourages reliable and disentangled parameter predictions for statistical 3D human models, such as SMPL. The proposed method achieves state-of-the-art performances on two large-scale datasets, namely SURREAL and DFAUST. Furthermore, VoteHMR also demonstrates superior generalization ability on real-world datasets, such as Berkeley MHAD.

Abstract:
Recently, fake news with text and images have achieved more effective diffusion than text-only fake news, raising a severe issue of multimodal fake news detection. Current studies on this issue have made significant contributions to developing multimodal models, but they are defective in modeling the multimodal content sufficiently. Most of them only preliminarily model the basic semantics of the images as a supplement to the text, which limits their performance on detection. In this paper, we find three valuable text-image correlations in multimodal fake news: entity inconsistency, mutual enhancement, and text complementation. To effectively capture these multimodal clues, we innovatively extract visual entities (such as celebrities and landmarks) to understand the news-related high-level semantics of images, and then model the multimodal entity inconsistency and mutual enhancement with the help of visual entities. Moreover, we extract the embedded text in images as the complementation of the original text. All things considered, we propose a novel entity-enhanced multimodal fusion framework, which simultaneously models three cross-modal correlations to detect diverse multimodal fake news. Extensive experiments demonstrate the superiority of our model compared to the state of the art.

Abstract:
We tackle the problem of learning the underlying disentangled latent factors that are shared between the paired bi-modal data in cross-modal retrieval. Typically the data in both modalities are complex, structured, and high dimensional (e.g., image and text), for which the conventional deep auto-encoding latent variable models such as the Variational Autoencoder (VAE) often suffer from difficulty of accurate decoder training or realistic synthesis. In this paper we propose a novel idea of the implicit decoder, which completely removes the ambient data decoding module from a latent variable model, via implicit encoder inversion that is achieved by Jacobian regularization of the low-dimensional embedding function. Motivated from the recent Identifiable-VAE (IVAE) model, we modify it to incorporate the query modality data as conditioning auxiliary input, which allows us to prove that the true parameters of the model can be identifiable under some regularity conditions. Tested on various datasets where the true factors are fully/partially available, our model is shown to identify the factors accurately, significantly outperforming conventional latent variable models.

Abstract:
Due to privacy concerns, there is a rising favor in Recommender System community for the One-class Collaborative Filtering (OCCF) framework, which predicts user preferences only based on binary implicit feedback (e.g., click or not-click, rated or unrated). The major challenge in OCCF problem stems from the inherent noise in implicit interaction. Previous approaches have taken into account the noise in unobserved interactions (i.e., not-click only means a missing value, rather than negative feedback). However, they generally ignore the noise in observed interactions (i.e., click does not necessarily represent positive feedback), which might induce performance degradation. To attack this issue, we propose a novel iteratively relabeling framework to jointly mitigate the noise in both observed and unobserved interactions. As the core of the framework, the iterative relabeling module exploits the self-training principle to dynamically generate pseudo labels for user preferences. The downstream module for a recommendation task is then trained with the refreshed labels where the noisy patterns are largely alleviated. Finally, extensive experiments on three real-world datasets demonstrate the effectiveness of our proposed methods.

Abstract:
This companion paper is to support the replication of paper "Campus3D: A Photogrammetry Point Cloud Benchmark for Outdoor Scene Hierarchical Understanding", which was presented at ACM Multimedia 2020. The supported paper's main purpose was to provide a photogrammetry point cloud-based dataset with hierarchical multilabels to facilitate the area of 3D deep learning. Based on this provided dataset and source code, in this work, we build a complete package to reimplement the proposed methods and experiments (i.e., the hierarchical learning framework and the benchmarks of the hierarchical semantic segmentation task). Specifically, this paper contains the technical details of the package, including file structure, dataset preparation, installation package, and the conduction of the experiment. We also present the replicated experiment results and indicate our contributions to the original implementation.

Abstract:
Unsupervised learning of global features for 3D shape analysis is an important research challenge because it avoids manual effort for supervised information collection. In this paper, we propose a view-based deep learning model called Hierarchical View Predictor (HVP) to learn 3D shape features from unordered views in an unsupervised manner. To mine highly discriminative information from unordered views, HVP performs a novel hierarchical view prediction over a view pair, and aggregates the knowledge learned from the predictions in all view pairs into a global feature. In a view pair, we pose hierarchical view prediction as the task of hierarchically predicting a set of image patches in a current view from its complementary set of patches, and in addition, completing the current view and its opposite from any one of the two sets of patches. Hierarchical prediction, in patches to patches, patches to view and view to view, facilitates HVP to effectively learn the structure of 3D shapes from the correlation between patches in the same view and the correlation between a pair of complementary views. In addition, the employed implicit aggregation over all view pairs enables HVP to learn global features from unordered views. Our results show that HVP can outperform state-of-the-art methods under large-scale 3D shape benchmarks in shape classification and retrieval.

Abstract:
3D reconstruction of stereo endoscope image, as an enabling technique for varied surgical systems, e.g., medical droids, navigations, etc., suffers from severe overfitting problems due to scarce labels. Semi-supervised learning based on Teacher-Student Network (TSN) is a potential solution, which utilizes a supervised teacher model trained on available labeled data to teach a student model on all images via assigning them pseudo labels. However, TSN often faces a dilemma: if given only few labeled endoscope images, the teacher model will be trained to be defective and induce high-noised pseudo labels, degrading the student model significantly. To solve this, we propose an improved TSN for a robust 3D reconstruction of stereo endoscope image. Specifically, two novel modules are introduced: 1) a semi-supervised teacher model based on adversarial learning to produce mostly correct pseudo labels by forcing a consistency in predictions for both labeled and unlabeled data, and 2) a confidence network to further filter out noisy pseudo labels by estimating a confidence for each prediction of the teacher model. By doing so, the student model is able to distill knowledge from more accurate and noiseless pseudo labels, thus achieving improved performance. Experimental results on two public datasets show that our improved TSN achieves a superior performance than the state-of-the-arts by reducing the averaged disparity error by at least 13.5%.

Abstract:
This paper presents the method that underlies our submission to the Pre-training for Video Understanding Challenge Track II. We follow the basic pipeline of temporal segment networks [20] and further improve its performance in several aspects. Specifically, we use the latest transformer-based architectures, e.g., Swin Transformer, DeiT, CLIP-ViT, to enhance the representation power. We analyze different pre-training proxy tasks on the official pre-training datasets and other open-source video datasets. With these techniques, we derive an ensemble of deep models to attain a high classification accuracy (Top-1 accuracy 62.28%) on the testing set and secures first place in Track II of this challenge.

Abstract:
Facial micro-expressions (FMEs) are involuntary facial movements that occur spontaneously when a person experiences an emotion but tries to suppress or repress the facial expression and usually occur in high-risk situations. Thus, FMEs are very short in duration, an important feature that distinguishes them from ordinary facial expressions. And MEs are considered to be one of the most valuable cues for complex human emotion understanding and lie detection. Since 2014, the computational analysis and automation of MEs have been an emerging area of face research. The workshop will explore various dimensions of the human mind through emotion understanding and FME analysis, as well as extended research based on multi modal approaches.

Abstract:
Due to the large success in object detection and instance segmentation, Mask R-CNN attracts great attention and is widely adopted as a strong baseline for arbitrary-shaped scene text detection and spotting. However, two issues remain to be settled. The first is dense text case, which is easy to be neglected but quite practical. There may exist multiple instances in one proposal, which makes it difficult for the mask head to distinguish different instances and degrades the performance. In this work, we argue that the performance degradation results from the learning confusion issue in the mask head. We propose to use an MLP decoder instead of the "deconv-conv" decoder in the mask head, which alleviates the issue and promotes robustness significantly. And we propose instance-aware mask learning in which the mask head learns to predict the shape of the whole instance rather than classify each pixel to text or non-text. With instance-aware mask learning, the mask branch can learn separated and compact masks. The second is that due to large variations in scale and aspect ratio, RPN needs complicated anchor settings, making it hard to maintain and transfer across different datasets. To settle this issue, we propose an adaptive label assignment in which all instances especially those with extreme aspect ratios are guaranteed to be associated with enough anchors. Equipped with these components, the proposed method named MAYOR achieves state-of-the-art performance on five benchmarks including DAST1500, MSRA-TD500, ICDAR2015, CTW1500, and Total-Text.

Abstract:
This companion paper supports the experimental replication of the paper "Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality Assessment'' presented at ACM Multimedia 2020. We provide the software package for replicating the implementation of the "Norm-in-Norm'' loss and the corresponding "LinearityIQA'' model used in the original paper. This paper contains the guidelines to reproduce all the experimental results of the original paper.

Abstract:
Versatile Video Coding (VVC) is the most recent international video coding standard jointly developed by ITU-T and ISO/IEC, which has been finalized in July 2020. VVC allows for significant bit-rate reductions around 50% for the same subjective video quality compared to its predecessor, High Efficiency Video Coding (HEVC). One year after finalization, VVC support in devices and chipsets is still under development, which is aligned with the typical development cycles of new video coding standards. This paper presents open-source software packages that allow building a complete VVC end-to-end toolchain already one year after its finalization. This includes the Fraunhofer HHI VVenC library for fast and efficient VVC encoding as well as HHI's VVdeC library for live decoding. An experimental integration of VVC in the GPAC software tools and FFmpeg media framework allows packaging VVC bitstreams, e.g. encoded with VVenC, in MP4 file format and using DASH for content creation and streaming. The integration of VVdeC allows playback on the receiver. Given these packages, step-by-step tutorials are provided for two possible application scenarios: VVC file encoding plus playback and adaptive streaming with DASH.

Abstract:
In this paper, we utilize facial action units (AUs) detection to construct an end-to-end deep learning framework for the macro- and micro-expressions spotting task in long video sequences. The proposed framework focuses on individual components of facial muscle movement rather than processing the whole image, which eliminates the influence of image change caused by noises, such as body or head movement. Compared with existing models deploying deep learning methods with classical Convolutional Neural Network (CNN) models, the proposed framework utilizes Gated Recurrent Unit (GRU) or Long Short-term Memory (LSTM) or our proposed Concat-CNN models to learn the characteristic correlation between AUs of distinctive frames. The Concat-CNN uses three convolutional kernels with different sizes to observe features of different duration and emphasizes both local and global mutation features by changing dimensionality (max-pooling size) of the output space. Our proposal achieves state-of-the-art performance from the aspect of overall F1-scores: 0.2019 on CAS(ME)2-cropped, 0.2736 on SAMM Long Video, and 0.2118 on CAS(ME)2, which not only outperforms the baseline but is also ranked the 3rd of FME challenge 2021 for combined datasets of CAS(ME)2-cropped and SAMM-LV.

Abstract:
Multi-sensory data has exhibited a clear advantage in expressing richer and more complex feelings, on the Emotion Recognition in Conversation (ERC) task. Yet, current methods for multimodal dynamics that aggregate modalities or employ additional modality-specific and modality-shared networks are still inadequate in balancing between the sufficiency of multimodal processing and the scalability to incremental multi-sensory data type additions. This incurs a bottleneck of performance improvement of ERC. To this end, we present MetaDrop, a differentiable and end-to-end approach for the ERC task that learns module-wise decisions across modalities and conversation flows simultaneously, which supports adaptive information sharing pattern and dynamic fusion paths. Our framework mitigates the problem of modelling complex multimodal relations while ensuring it enjoys good scalability to the number of modalities. Experiments on two popular multimodal ERC datasets show that MetaDrop achieves new state-of-the-art results.

Abstract:
Nowadays, almost all the online orders were placed through screened devices such as mobile phones, tablets, and computers. With the rapid development of the Internet of Things (IoT) and smart appliances, more and more screenless smart devices, e.g., smart speaker and smart refrigerator, appear in our daily lives. They open up new means of interaction and may provide an excellent opportunity to reach new customers and increase sales. However, not all the items are suitable for screenless shopping, since some items' appearance play an important role in consumer decision making. Typical examples include clothes, dolls, bags, and shoes. In this paper, we aim to infer the significance of every item's appearance in consumer decision making and identify the group of items that are suitable for screenless shopping. Specifically, we formulate the problem as a classification task that predicts if an item's appearance has a significant impact on people's purchase behavior. To solve this problem, we extract multi-modal features from three different views, and collect a set of necessary labels via crowdsourcing. We then propose an iterative semi-supervised learning framework with a carefully designed multi-modal enhancement module. Experimental results verify the effectiveness of the proposed method.

Abstract:
Blind natural video quality assessment (BVQA), also known as no-reference video quality assessment, is a highly active research topic. In our recent contribution titled "Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial Features" published in ACM Multimedia 2020, we proposed a two-level video quality model employing statistical temporal features and spatial features extracted by a deep convolutional neural network (CNN) for this purpose. At the time of publishing, the proposed model (CNN-TLVQM) achieved state-of-the-art results in BVQA. In this paper, we describe the process of reproducing the published results by using CNN-TLVQM on two publicly available natural video quality datasets.

Abstract:
With the wide applications of colored point cloud (CPC) in many fields, many attentions have been paid to CPC's distortions caused by its compression and reconstruction. How to effectively evaluate the visual quality of CPC has become an urgent issue to be resolved. In this paper, a Point cloud projection and Multi-scale feature fusion network based Blind Visual Quality Assessment method (denoted as PM-BVQA) is proposed for CPC. CPC in 3D space is first projected into 2D color projection map and geometric projection map, then a multi-scale feature fusion network is designed to blindly evaluate the visual quality of CPC. The proposed PM-BVQA method includes three modules, that is, joint color-geometric feature extractor, two-stage multi-scale feature fusion, and spatial pooling module. Considering the multi-channel characteristics of human visual system (HVS), unimodal features of different scales are obtained by joint color-geometric feature extractor from the color and geometric projection maps. The fusion of the unimodal color and geometric features is carried out to capture the cross-modal complementary information between these two types of information. By integrating cross-modal fused features at different scales, the complementary relationships between different channels of HVS are simulated. The spatial pooling module takes into account the attention mechanism of HVS and realizes the weighted summation of local regional quality to obtain the final global quality score of CPC. A subjective CPC database with coding distortion is used to verify the effectiveness of the proposed method, and the experimental results show that the proposed blind quality assessment method is more consistent with the subjective visual perception than the existing quality assessment methods.

Abstract:
A creative image-and-text generative AI system mimics humans' extraordinary abilities to provide users with diverse and comprehensive caption suggestions, as well as rich image creations. In this work, we demonstrate such an AI creation system to produce both diverse captions and rich images. When users imagine an image and associate it with multiple captions, our system paints a rich image to reflect all captions faithfully. Likewise, when users upload an image, our system depicts it with multiple diverse captions. We propose a unified multi-modal framework to achieve this goal. Specifically, our framework jointly models image-and-text representations with a Transformer network, which supports rich image creation by accepting multiple captions as input. We consider the relations among input captions to encourage diversity in training and adopt a non-autoregressive decoding strategy to enable real-time inference. Based on these, our system supports both diverse captions and rich images generations. Our code is available online.

Abstract:
The Multi-access Edge Computing (MEC) paradigm offers cloud-computing support to rich media applications, including Dynamic Adaptive Streaming over HTTP (DASH)-based ones at the edge of the network, close to mobile users. MEC servers, typically deployed at base stations (BS), help reduce latency and improve quality of experience (QoE) of video streaming. Unfortunately the communications involving mobile users require handovers between BSs and these influence both transmission efficiency because of the relative position of the MEC servers and transit cost. At the same time, serving MEC for a mobile user should not necessarily be changed when handover occurs. This paper introduces QoE Ready to Respond (QoE-R2R), a QoE-aware MEC Selection scheme for DASH-based mobile adaptive video streaming for optimizing video transmission in a MEC-supported network environment. Simulation-based testing shows that the proposed (QoE-R2R) scheme outperforms some traditional alternative solutions. Compared to hit rate and delay-based schemes, QoE-R2R reduces by 27.6% transmission time and improves with 6.2% QoE.

Abstract:
In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, compositing synthetic 3D objects within real scenes. We show the performance of the proposed system in the context of object detection in thermal videos, a domain where i) training datasets are very limited compared to visible spectrum datasets and ii) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques. Experimental results demonstrate the effectiveness of our approach, and our single-modality detector achieves state-of-the-art results on the FLIR ADAS dataset.

Abstract:
In the artwork, the topic of life space has been discussed. Instead of physical space, mental space of human being was considered. People usually focus on themselves to solve various life tasks, and the scales of their mental space influence how they realize the world. The artwork tried to arouse people to aware the connection between mental space and life space. The Sand Scope introduces a microcosm of the world, for comparing the scale of mental space with the scale of the microcosm, from the relative scales between the microcosm and the whole world. From the new perspective, the Sand Scope reminds people to escape from the routine of their daily lives, for rethinking meaning of life. Multimedia input contains gray image analysis to form the stamps with portraits of current audiences and past participants, color image subtraction to compose texture of the mountain drawing with wearing cloth information on the painting, and a buffer with timer to capture and replay ambient sounds continuously in a delayed time, along with color images with blending effect in the period. In the interactive installation, an improvisational painting in the form of a Chinese brush painting with stamps from connoisseurs was exhibited. The generation of stamps from the audiences on the painting also indicates that they are parts of the microcosm. The microcosm was constructed from the elements of the inhabitants who live in the real world in physical aspect, and the awareness of meaning of life implies harmony between nature and humanity on the Zen painting in mental aspect.