Paperid:1
Authors:Yongheng Zhao, Tolga Birdal, Jan Eric Lenssen, Emanuele Menegatti, Leonidas Guibas, Federico Tombari
Title: Quaternion Equivariant Capsule Networks for 3D Point Clouds
Abstract:
We present a 3D capsule module for processing point clouds that is equivariant to 3D rotations and translations, as well as invariant to permutations of the input points. The operator receives a sparse set of local reference frames, computed from an input point cloud and establishes end-to-end transformation equivariance through a novel dynamic routing procedure on quaternions. Further, we theoretically connect dynamic routing between capsules to the well-known Weiszfeld algorithm, a scheme for solving iterative re-weighted least squares (IRLS) problems with provable convergence properties. It is shown that such group dynamic routing can be interpreted as robust IRLS rotation averaging on capsule votes, where information is routed based on the final inlier scores. Based on our operator, we build a capsule network that disentangles geometry from pose, paving the way for more informative descriptors and a structured latent space. Our architecture allows joint object classification and orientation estimation without explicit supervision of rotations. We validate our algorithm empirically on common benchmark datasets."



Paperid:2
Authors:Yizhak Ben-Shabat, Stephen Gould
Title: DeepFit: 3D Surface Fitting via Neural Network Weighted Least Squares
Abstract:
We propose a surface fitting method for unstructured 3D point clouds. This method, called DeepFit, incorporates a neural network to learn point-wise weights for weighted least squares polynomial surface fitting. The learned weights act as a soft selection for the neighborhood of surface points thus avoiding the scale selection required of previous methods. To train the network we propose a novel surface consistency loss that improves point weight estimation. The method enables extracting normal vectors and other geometrical properties, such as principal curvatures, the latter were not presented as ground truth during training. We achieve state-of-the-art results on a benchmark normal and curvature estimation dataset,demonstrate robustness to noise, outliers and density variations, and show its application on noise removal."



Paperid:3
Authors:Zhichao Lu, Kalyanmoy Deb, Erik Goodman, Wolfgang Banzhaf, Vishnu Naresh Boddeti
Title: NSGANetV2: Evolutionary Multi-Objective Surrogate-Assisted Neural Architecture Search
Abstract:
In this paper, we propose an efficient NAS algorithm for generating task-specific models that are competitive under multiple competing objectives. It comprises of two surrogates, one at the architecture level to improve sample efficiency and one at the weights level, through a supernet, to improve gradient descent training efficiency. On standard benchmark datasets (C10, C100, ImageNet), the resulting models, dubbed NSGANetV2, either match or outperform models from existing approaches with the search being orders of magnitude more sample efficient. Furthermore, we demonstrate the effectiveness and versatility of the proposed method on six diverse non-standard datasets, e.g. STL-10, Flowers102, Oxford Pets, FGVC Aircrafts etc. In all cases, NSGANetV2s improve the state-of-the-art (under mobile setting), suggesting that NAS can be a viable alternative to conventional transfer learning approaches in handling diverse scenarios such as small-scale or fine-grained datasets. Code is available at https://github.com/mikelzc1990/nsganetv2"



Paperid:4
Authors:Chenyun Wu, Mikayla Timm, Subhransu Maji
Title: Describing Textures using Natural Language
Abstract:
Textures in natural images can be characterized by color, shape, periodicity of elements within them, and other attributes that can be described using natural language. In this paper, we study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures, and conduct a systematic study of current generative and discriminative models for grounding language to images on this dataset. We find that while these models capture some properties of texture, they fail to capture several compositional properties, such as the colors of dots. We provide critical analysis of existing models by generating synthetic but realistic textures with different descriptions. Our dataset also allows us to train interpretable models and generate language-based explanations of what discriminative features are learned by deep networks for fine-grained categorization where texture plays a key role. We present visualizations of several fine-grained domains and show that texture attributes learned on our dataset offer improvements over expert-designed attributes on the Caltech-UCSD Birds dataset."



Paperid:5
Authors:Rizard Renanda Adhi Pramono, Yie Tarng Chen, Wen Hsien Fang
Title: Empowering Relational Network by Self-Attention Augmented Conditional Random Fields for Group Activity Recognition
Abstract:
This paper presents a novel relational network for group activity recognition. The core of our network is to augment the conditional random fields (CRF), amenable to learning inter-dependency of correlated observations, with the newly devised temporal and spatial self-attention to learn the temporal evolution and spatial relational contexts of every actor in videos. Such a combination utilizes the global receptive fields of self-attention to construct a spatio-temporal graph topology to address the temporal dependency and non-local relationships of the actors. The network first uses the temporal self-attention along with the spatial self-attention, which considers multiple cliques with different scales of locality to account for the diversity of the actors' relationships in group activities, to model the pairwise energy of CRF. Afterward, to accommodate the distinct characteristics of each video, a new mean-field inference algorithm with dynamic halting is also addressed. Finally, a bidirectional universal transformer encoder (UTE), which combines both of the forward and backward temporal context information, is used to aggregate the relational contexts and scene information for group activity recognition. Simulations show that the proposed approach surpasses the state-of-the-art methods on the widespread Volleyball and Collective Activity datasets."



Paperid:6
Authors:Shi Chen, Ming Jiang, Jinhui Yang, Qi Zhao
Title: AiR: Attention with Reasoning Capability
Abstract:
While attention has been an increasingly popular component in deep neural networks to both interpret and boost performance of models, little work has examined how attention progresses to accomplish a task and whether it is reasonable. In this work, we propose an Attention with Reasoning capability (AiR) framework that uses attention to understand and improve the process leading to task outcomes. We first define an evaluation metric based on a sequence of atomic reasoning operations, enabling quantitative measurement of attention that considers the reasoning process. We then collect human eye-tracking and answer correctness data, and analyze various machine and human attentions on their reasoning capability and how they impact task performance. Furthermore, we propose a supervision method to jointly and progressively optimize attention, reasoning, and task performance so that models learn to look at regions of interests by following a reasoning process. We demonstrate the effectiveness of the proposed framework in analyzing and modeling attention with better reasoning capability and task performance. The code and data are available at https://github.com/szzexpoi/AiR"



Paperid:7
Authors:Gu Wang, Fabian Manhardt, Jianzhun Shao, Xiangyang Ji, Nassir Navab , Federico Tombari
Title: Self6D: Self-Supervised Monocular 6D Object Pose Estimation
Abstract:
6D object pose estimation is a fundamental problem in computer vision. Convolutional Neural Networks (CNNs) have recently proven to be capable of predicting reliable 6D pose estimates even from monocular images. Nonetheless, CNNs are identified as being extremely data-driven, and acquiring adequate annotations is oftentimes very time-consuming and labor intensive. To overcome this shortcoming, we propose the idea of monocular 6D pose estimation by means of self-supervised learning, removing the need for real annotations. After training our proposed network fully supervised with synthetic RGB data, we leverage recent advances in neural rendering to further self-supervise the model on unannotated real RGB-D data, seeking for a visually and geometrically optimal alignment. Extensive evaluations demonstrate that our proposed self-supervision is able to significantly enhance the model's original performance, outperforming all other methods relying on synthetic data or employing elaborate techniques from the domain adaptation realm."



Paperid:8
Authors:Mingqing Xiao, Shuxin Zheng, Chang Liu, Yaolong Wang, Di He, Guolin Ke, Jiang Bian, Zhouchen Lin, Tie-Yan Liu
Title: Invertible Image Rescaling
Abstract:
High-resolution digital images are usually downscaled to fit various display screens or save the cost of storage and bandwidth, meanwhile the post-upscaling is adpoted to recover the original resolutions or the details in the zoom-in images. However, typical image downscaling is a non-injective mapping due to the loss of high-frequency information, which leads to the ill-posed problem of the inverse upscaling procedure and poses great challenges for recovering details from the downscaled low-resolution images. Simply upscaling with image super-resolution methods results in unsatisfactory recovering performance. In this work, we propose to solve this problem by modeling the downscaling and upscaling processes from a new perspective, i.e. an invertible bijective transformation, which can largely mitigate the ill-posed nature of image upscaling. We develop an Invertible Rescaling Net (IRN) with deliberately designed framework and objectives to produce visually-pleasing low-resolution images and meanwhile capture the distribution of the lost information using a latent variable following a specified distribution in the downscaling process. In this way, upscaling is made tractable by inversely passing a randomly-drawn latent variable with the low-resolution image through the network. Experimental results demonstrate the significant improvement of our model over existing methods in terms of both quantitative and qualitative evaluations of image upscaling reconstruction from downscaled images. Code is available at https://github.com/pkuxmq/Invertible-Image-Rescaling."



Paperid:9
Authors:Yingda Xia, Yi Zhang, Fengze Liu, Wei Shen, Alan L. Yuille
Title: Synthesize then Compare: Detecting Failures and Anomalies for Semantic Segmentation
Abstract:
The ability to detect failures and anomalies are fundamental requirements for building reliable systems for computer vision applications, especially safety-critical applications of semantic segmentation, such as autonomous driving and medical image analysis. In this paper, we systematically study failure and anomaly detection for semantic segmentation and propose a unified framework, consisting of two modules, to address these two related problems. The first module is an image synthesis module, which generates a synthesized image from a segmentation layout map, and the second is a comparison module, which computes the difference between the synthesized image and the input image. We validate our framework on three challenging datasets and improve the state-of-the-arts by large margins, i.e., 6% AUPR-Error on Cityscapes, 7% Pearson correlation on pancreatic tumor segmentation in MSD and 20% AUPR on StreetHazards anomaly segmentation."



Paperid:10
Authors:Nelson Nauata, Kai-Hung Chang, Chin-Yi Cheng, Greg Mori, Yasutaka Furukawa
Title: House-GAN: Relational Generative Adversarial Networks for Graph-constrained House Layout Generation
Abstract:
This paper proposes a novel graph-constrained generative adversarial network, whose generator and discriminator are built upon relational architecture. The main idea is to encode the constraint into the graph structure of its relational networks. We have demonstrated the proposed architecture for a new house layout generation problem, whose task is to take an architectural constraint as a graph (i.e., the number and types of rooms with their spatial adjacency) and produce a set of axis-aligned bounding boxes of rooms. We measure the quality of generated house layouts with the three metrics: the realism, the diversity, and the compatibility with the input graph constraint. Our qualitative and quantitative evaluations over 117,000 real floorplan images demonstrate. We will publicly share all our code and data."



Paperid:11
Authors:Zhengqi Li, Wenqi Xian, Abe Davis, Noah Snavely
Title: Crowdsampling the Plenoptic Function
Abstract:
Many popular tourist landmarks are captured in a multitude of online, public photos. These photos represent a sparse and unstructured sampling of the plenoptic function for a particular scene. In this paper,we present a new approach to novel view synthesis under time-varying illumination from such data. Our approach builds on the recent multi-plane image (MPI) format for representing local light fields under fixed viewing conditions. We introduce a new DeepMPI representation, motivated by observations on the sparsity structure of the plenoptic function, that allows for real-time synthesis of photorealistic views that are continuous in both space and across changes in lighting. Our method can synthesize the same compelling parallax and view-dependent effects as previous MPI methods, while simultaneously interpolating along changes in reflectance and illumination with time. We show how to learn a model of these effects in an unsupervised way from an unstructured collection of photos without temporal registration, demonstrating significant improvements over recent work in neural rendering. More information can be found crowdsampling.io."



Paperid:12
Authors:Hanyue Tu, Chunyu Wang, Wenjun Zeng
Title: VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment
Abstract:
We present mph{VoxelPose} to estimate $3$D poses of multiple people from multiple camera views. In contrast to the previous efforts which require to establish cross-view correspondence based on noisy and incomplete $2$D pose estimates, mph{VoxelPose} directly operates in the $3$D space therefore avoids making incorrect decisions in each camera view. To achieve this goal, features in all camera views are aggregated in the $3$D voxel space and fed into mph{Cuboid Proposal Network} (CPN) to localize all people. Then we propose mph{Pose Regression Network} (PRN) to estimate a detailed $3$D pose for each proposal. The approach is robust to occlusion which occurs frequently in practice. Without bells and whistles, it outperforms the previous methods on several public datasets."



Paperid:13
Authors:Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, Sergey Zagoruyko
Title: End-to-End Object Detection with Transformers
Abstract:
We present a new method that views object detection as a direct set prediction. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components like a non-maximum suppression procedure or anchor generation that explicitly encode our prior knowledge about the task. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss that forces unique predictions via bipartite matching, and a transformer encoder-decoder architecture. Given a fixed small set of learned object queries, DETR reasons about the relations of the objects and the global image context to directly output the final set of predictions in parallel. The new model is conceptually simple and does not require a specialized library, unlike many other modern detectors. DETR demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster R-CNN baseline on the challenging COCO object detection dataset. Moreover, we show that DETR can be easily generalized to produce a competitive panoptic segmentation prediction in a unified manner."



Paperid:14
Authors:Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, Xiangyang Xue
Title: DeepSFM: Structure From Motion Via Deep Bundle Adjustment
Abstract:
Structure from motion (SfM) is an essential computer vision problem which has not been well handled by deep learning. One of the promising trends is to apply explicit structural constraint, e.g. 3D cost volume, into the network. However, existing methods usually assume accurate camera poses either from GT or other methods, which is unrealistic in practice. In this work, we design a physical driven architecture, namely DeepSFM, inspired by traditional Bundle Adjustment (BA), which consists of two cost volume based architectures for depth and pose estimation respectively, iteratively running to improve both. The explicit constraints on both depth (structure) and pose (motion), when combined with the learning components, bring the merit from both traditional BA and emerging deep learning technology. Extensive experiments on various datasets show that our model achieves the state-of-the-art performance on both depth and pose estimation with superior robustness against less number of inputs and the noise in initialization."



Paperid:15
Authors:Yifan Xu, Tianqi Fan, Yi Yuan, Gurprit Singh
Title: Ladybird: Quasi-Monte Carlo Sampling for Deep Implicit Field Based 3D Reconstruction with Symmetry
Abstract:
Deep implicit field regression methods are effective for 3D reconstruction from single-view images. However, the impact of different sampling patterns on the reconstruction quality is not well-understood. In this work, we first study the effect of point set discrepancy on the network training. Based on Farthest Point Sampling algorithm, we propose a sampling scheme that theoretically encourages better generalization performance, and results in fast convergence for SGD-based optimization algorithms. Secondly, based on the reflective symmetry of an object, we propose a feature fusion method that alleviates issues due to self-occlusions which makes it difficult to utilize local image features. Our proposed system Ladybird is able to create high quality 3D object reconstructions from a single input image. We evaluate Ladybird on a large scale 3D dataset (ShapeNet) demonstrating highly competitive results in terms of Chamfer distance, Earth Mover's distance and Intersection Over Union (IoU). "



Paperid:16
Authors:Zhenbo Xu, Wei Zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, Liusheng Huang
Title: Segment as Points for Efficient Online Multi-Object Tracking and Segmentation
Abstract:
Current multi-object tracking and segmentation (MOTS) methods follow the tracking-by-detection paradigm and adopt convolutions for feature extraction. However, as affected by the inherent receptive field, convolution based feature extraction inevitably mixes up the foreground features and the background features, resulting in ambiguities in the subsequent instance association. In this paper, we propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation. Our method generates a new tracking-by-points paradigm where discriminative instance embeddings are learned from randomly selected points rather than images. Furthermore, multiple informative data modalities are converted into point-wise representations to enrich point-wise features. The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods including 3D tracking methods by large margins (5.4\% higher MOTSA and 18 times faster over MOTSFusion) with the near real-time speed (22 FPS). Evaluations across three datasets demonstrate both the effectiveness and efficiency of our method. Moreover, based on the observation that current MOTS datasets lack crowded scenes, we build a more challenging MOTS dataset named APOLLO MOTS with higher instance density. Both APOLLO MOTS and our codes are publicly available at https://github.com/detectRecog/PointTrack."



Paperid:17
Authors:Zhi Tian, Chunhua Shen, Hao Chen
Title: Conditional Convolutions for Instance Segmentation
Abstract:
We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instance-wise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: 1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment. 2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including well-tuned Mask R-CNN baselines, without longer training schedules needed. Code is available: https://git.io/AdelaiDet"



Paperid:18
Authors:Taojiannan Yang, Sijie Zhu, Chen Chen, Shen Yan, Mi Zhang, Andrew Willis
Title: MutualNet: Adaptive ConvNet via Mutual Learning from Network Width and Resolution
Abstract:
We propose the width-resolution mutual learning method (MutualNet) to train a network that is executable at dynamic resource constraints to achieve adaptive accuracy-efficiency trade-offs at runtime. Our method trains a cohort of sub-networks with different widths using different input resolutions to mutually learn multi-scale representations for each sub-network. It achieves consistently better ImageNet top-1 accuracy over the state-of-the-art adaptive network US-Net under different computation constraints, and outperforms the best compound scaled MobileNet in EfficientNet by 1.5%. The superiority of our method is also validated on COCO object detection and instance segmentation as well as transfer learning. Surprisingly, the training strategy of MutualNet can also boost the performance of a single network, which substantially outperforms the powerful AutoAugmentation in both efficiency (GPU search hours: 15000 vs. 0) and accuracy (ImageNet: 77.6% vs. 78.6%). Code is provided in supplementary material."



Paperid:19
Authors:Menglin Jia, Mengyun Shi, Mikhail Sirotenko, Yin Cui, Claire Cardie , Bharath Hariharan, Hartwig Adam, Serge Belongie
Title: Fashionpedia: Ontology, Segmentation, and an Attribute Localization Dataset
Abstract:
Segmentation, and an Attribute Localization Dataset","In this work, we focus on the task of instance segmentation with attribute localization. This unifies instance segmentation (detect and segment each object instance) and visual categorization of fine-grained attributes (classify one or multiple attributes). The proposed task requires both localizing an object and describing its properties. To illustrate the various aspects of this task, we focus on the domain of fashion and introduce Fashionpedia as a step toward mapping out the visual aspects of the fashion world. Fashionpedia consists of two parts: (1) an ontology built by fashion experts containing 27 main apparel categories, 19 apparel parts, and 294 fine-grained attributes and their relationships and (2) a dataset consisting of everyday and celebrity event fashion images annotated with segmentation masks and their associated fine-grained attributes, built upon the backbone of the Fashionpedia ontology. In order to solve this challenging task, we propose a novel Attribute-Mask R-CNN model to jointly perform instance segmentation and localized attribute recognition, and provide a novel evaluation metric for the task. Fashionpedia is available at https://fashionpedia.github.io/home/. "



Paperid:20
Authors:Marcel Geppert, Viktor Larsson, Pablo Speciale, Johannes L. Schönberger, Marc Pollefeys
Title: Privacy Preserving Structure-from-Motion
Abstract:
Over the last years, visual localization and mapping solutions have been adopted by an increasing number of mixed reality and robotics systems. The recent trend towards cloud-based localization and mapping systems has raised significant privacy concerns. These are mainly grounded by the fact that these services require users to upload visual data to their servers, which can reveal potentially confidential information, even if only derived image features are uploaded. Recent research addresses some of these concerns for the task of image-based localization by concealing the geometry of the query images and database maps. The core idea of the approach is to lift 2D/3D feature points to random lines, while still providing sufficient constraints for camera pose estimation. In this paper, we further build upon this idea and propose solutions to the different core algorithms of an incremental Structure-from-Motion pipeline based on random line features. With this work, we make another fundamental step towards enabling privacy preserving cloud-based mapping solutions. Various experiments on challenging real-world datasets demonstrate the practicality of our approach achieving comparable results to standard Structure-from-Motion systems."



Paperid:21
Authors:David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba
Title: Rewriting a Deep Generative Model
Abstract:
A deep generative model such as a GAN learns to model a rich set of semantic and physical rules about the target distribution, but up to now, it has been obscure how such rules are encoded in the network, or how a rule could be changed. In this paper, we introduce a new problem setting: manipulation of specific rules encoded by a deep generative model. To address the problem, we propose a formulation in which the desired rule is changed by manipulating a layer of a deep network as a linear associative memory. We derive an algorithm for modifying one entry of the associative memory, and we demonstrate that several interesting structural rules can be located and modified within the layers of state-of-the-art generative models. We present a user interface to enable users to interactively change the rules of a generative model to achieve desired effects, and we show several proof-of-concept applications. Finally, results on multiple datasets demonstrate the advantage of our method against standard fine-tuning methods and edit transfer algorithms."



Paperid:22
Authors:Jiuniu Wang, Wenjia Xu, Qingzhong Wang, Antoni B. Chan
Title: Compare and Reweight: Distinctive Image Captioning Using Similar Images Sets
Abstract:
A wide range of image captioning models has been developed, achieving significant improvement based on popular metrics, such as BLEU, CIDEr, and SPICE. However, although the generated captions can accurately describe the image, they are generic for similar images and lack distinctiveness, i.e., cannot properly describe the uniqueness of each image. In this paper, we aim to improve the distinctiveness of image captions through training with sets of similar images. First, we propose a distinctiveness metric --- between-set CIDEr (CIDErBtw) to evaluate the distinctiveness of a caption with respect to those of similar images. Our metric shows that the human annotations of each image are not equivalent based on distinctiveness. Thus we propose several new training strategies to encourage the distinctiveness of the generated caption for each image, which are based on using CIDErBtw in a weighted loss function or as a reinforcement learning reward. Finally, extensive experiments are conducted, showing that our proposed approach significantly improves both distinctiveness (as measured by CIDErBtw and retrieval metrics) and accuracy (e.g., as measured by CIDEr) for a wide variety of image captioning baselines. These results are further confirmed through a user study."



Paperid:23
Authors:Zhe Cao, Hang Gao, Karttikeya Mangalam, Qi-Zhi Cai, Minh Vo, Jitendra Malik
Title: Long-term Human Motion Prediction with Scene Context
Abstract:
Human movement is goal-directed and influenced by the spatial layout of the objects in the scene. To plan future human motion, it is crucial to perceive the environment -- imagine how hard it is to navigate a new room with lights off. Existing works on predicting human motion do not pay attention to the scene context and thus struggle in long-term prediction. In this work, we propose a novel three-stage framework that exploits scene context to tackle this task. Given a single scene image and 2D pose histories, our method first samples multiple human motion goals, then plans 3D human paths towards each goal, and finally predicts 3D human pose sequences following each path. For stable training and rigorous evaluation, we contribute a diverse synthetic dataset with clean annotations. In both synthetic and real datasets, our method shows consistent quantitative and qualitative improvements over existing methods. "



Paperid:24
Authors:Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng
Title: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Abstract:
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location $(x,y,z)$ and viewing direction $( heta,\phi)$) and whose output is the volume density and view-dependent emitted radiance at that spatial location. We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis. View synthesis results are best viewed as videos, so we urge readers to view our supplementary video for convincing comparisons. We will make our code and data available upon publication."



Paperid:25
Authors:Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, Leonidas Guibas
Title: ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes
Abstract:
In this work we study the problem of using referential language to identify common objects in real-world 3D scenes. We focus on a challenging setup where the referred object belongs to a extit{fine-grained} object class and the underlying scene contains extit{multiple} object instances of that class. Due to the scarcity and unsuitability of existent 3D-oriented linguistic resources for this task, we first develop two large-scale and complementary visio-linguistic datasets: i) extbf{ extit{Sr3D}}, which contains 83.5K template-based utterances leveraging extit{spatial relations} with other fine-grained object classes to localize a referred object in a given scene, and ii) extbf{ extit{Nr3D}} which contains 41.5K extit{natural, free-form}, utterances collected by deploying a 2-player object reference game in 3D scenes. Using utterances of either datasets, human listeners can recognize the referred object with high ($>$86\%, 92\% resp.) accuracy. By tapping on this data, we develop novel neural listeners that can comprehend object-centric natural language and identify the referred object extit{directly} in a 3D scene. Our key technical contribution is designing an approach for combining linguistic and geometric information (in the form of 3D point-clouds) and creating multi-modal (3D) neural listeners. We also show that architectures which promote object-to-object communication via graph neural networks outperform less context-aware alternatives, and that language-assisted 3D object identification outperforms language-agnostic object classifiers."



Paperid:26
Authors:Benjamin Attal, Selena Ling, Aaron Gokaslan, Christian Richardt, James Tompkin
Title: MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images
Abstract:
We introduce a method to convert stereo 360 (omnidirectional stereo) imagery into a layered, multi-sphere image representation for six degree-of-freedom (6DoF) rendering. Stereo 360 imagery can be captured from multi-camera systems for virtual reality (VR) rendering, but lacks motion parallax and correct-in-all-directions disparity cues. Together, these can quickly lead to VR sickness when viewing content. One solution is to try and generate a format suitable for 6DoF rendering, such as by estimating depth. However, this raises questions as to how to handle disoccluded regions in dynamic scenes. Our approach is to simultaneously learn depth and blending weights via a multi-sphere image representation, which can be rendered with correct 6DoF disparity and motion parallax in VR. This significantly improves comfort for the viewer, and can be inferred and rendered in real time on modern GPU hardware. Together, these move towards making VR video a more comfortable immersive medium."



Paperid:27
Authors:Giorgos Tolias, Tomas Jenicek, Ondřej Chum
Title: Learning and Aggregating Deep Local Descriptors for Instance-level Recognition
Abstract:
We propose an efficient method to learn deep local descriptors for instance-level recognition. The training only requires examples of positive and negative image pairs and is performed as metric learning of sum-pooled global image descriptors. At inference, the local descriptors are provided by the activations of internal components of the network. We demonstrate why such an approach learns local descriptors that work well for image similarity estimation with classical efficient match kernel methods. The experimental validation studies the trade-off between performance and memory requirements of the state-of-the-art image search approach based on match kernels. Compared to existing local descriptors, the proposed ones perform better in two instance-level recognition tasks and keep memory requirements lower. We experimentally show that global descriptors are not effective enough at large scale and that local descriptors are essential. We achieve state-of-the-art performance, in some cases even with a backbone network as small as ResNet18."



Paperid:28
Authors:George Terzakis, Manolis Lourakis
Title: A Consistently Fast and Globally Optimal Solution to the Perspective-n-Point Problem
Abstract:
An approach for estimating the pose of a camera given a set of 3D points and their corresponding 2D image projections is presented. It formulates the problem as a non-linear quadratic program and identifies regions in the parameter space that contain unique minima with guarantees that at least one of them will be the global minimum. Each regional minimum is computed with a sequential quadratic programming scheme. These premises result in an algorithm that always determines the global minima of the perspective-n-point problem for any number of input correspondences, regardless of possible coplanar arrangements of the imaged 3D points. For its implementation, the algorithm merely requires ordinary operations available in any standard off-the-shelf linear algebra library. Comparative evaluation demonstrates that the algorithm achieves state-of-the-art results at a consistently low computational cost."



Paperid:29
Authors:Guangming Wu, Yinqiang Zheng, Zhiling Guo, Zekun Cai, Xiaodan Shi, Xin Ding, Yifei Huang, Yimin Guo, Ryosuke Shibasaki
Title: Learn to Recover Visible Color for Video Surveillance in a Day
Abstract:
In silicon sensors, the interference between visible and near-infrared (NIR) signals is a crucial problem. For all-day video surveillance, commercial camera systems usually adopt auxiliary NIR cut filter and NIR LED illumination to selectively block or enhance NIR signal according to the surrounding light conditions. This switching between the daytime and the nighttime mode inevitably involves mechanical parts, and thus requires frequent maintenance. Furthermore, images captured at nighttime mode are in shortage of chrominance, which might hinder human interpretation and high-level computer vision algorithms in succession. In this paper, we present a deep learning based approach that directly generates human-friendly, visible color for video surveillance in a day. To enable training, we capture well-aligned video pairs through a customized optical device and contribute a large-scale dataset, video surveillance in a day (VSIAD). We propose a novel multi-task deep network with state synchronization modules to better utilize texture and chrominance information. Our trained model generates high-quality visible color images and achieves state-of-the-art performance on multiple metrics as well as subjective judgment."



Paperid:30
Authors:Heming Zhu, Yu Cao, Hang Jin, Weikai Chen, Dong Du, Zhangye Wang, Shuguang Cui, Xiaoguang Han
Title: Deep Fashion3D: A Dataset and Benchmark for 3D Garment Reconstruction from Single Images
Abstract:
High-fidelity clothing reconstruction is the key to achieving photorealism in a wide range of applications including human digitization, virtual try-on, etc. Recent advances in learning-based approaches have accomplished unprecedented accuracy in recovering unclothed human shape and pose from single images, thanks to the availability of powerful statistical models, e.g. SMPL[38], learned from a large number of body scans. In contrast, modeling and recovering clothed human and 3D garments remains notoriously difficult, mostly due to the lack of large-scale clothing models available for the research community. We propose to fill this gap by introducing DeepFashion3D, the largest collection to date of 3D garment models, with the goal of establishing a novel benchmark and dataset for the evaluation of image-based garment reconstruction systems. Deep Fashion3D contains 2078 models reconstructed from real garments, which covers 10 different categories and 563 garment instances. It provides rich annotations including 3D feature lines, 3D body pose and the corresponded multi-view real images. In addition, each garment is randomly posed to enhance the variety of real clothing deformations. To demonstrate the advantage of \datasetName{}, we propose a novel baseline approach for single-view garment reconstruction, which leverages the merits of both mesh and implicit representations. A novel adaptable template is proposed to enable the learning of all types of clothing in a single network. Extensive experiments have been conducted on the proposed dataset to verify its significance and usefulness. We will make Deep Fashion3D publicly available upon publication. "



Paperid:31
Authors:Zhenda Xie, Zheng Zhang, Xizhou Zhu, Gao Huang, Stephen Lin
Title: Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation
Abstract:
In the feature maps of CNNs, there commonly exists considerable spatial redundancy that leads to much repetitive processing. Towards reducing this superfluous computation, we propose to compute features only at sparsely sampled locations, which are probabilistically chosen according to activation responses, and then densely reconstruct the feature map with an efficient interpolation procedure. With this sampling-interpolation scheme, our network avoids expending computation on spatial locations that can be effectively interpolated, while being robust to activation prediction errors through broadly distributed sampling. A technical challenge of this sampling-based approach is that the binary decision variables for representing discrete sampling locations are non-differentiable, making them incompatible with backpropagation. To circumvent this issue, we make use of a reparameterization trick based on the Gumbel-Softmax distribution, with which backpropagation can iterate these variables towards binary values. The presented network is experimentally shown to save substantial computation while maintaining accuracy over a variety of computer vision tasks."



Paperid:32
Authors:Han Qiu, Yuchen Ma, Zeming Li, Songtao Liu, Jian Sun
Title: BorderDet: Border Feature for Dense Object Detection
Abstract:
Dense object detectors rely on the sliding-window paradigm that predicts the object over a regular grid of image. Meanwhile, the feature maps on the point of the grid are adopted to generate the bounding box predictions. The point feature is convenient to use but may lack the explicit border information for accurate localization. In this paper, We propose a simple and efficient operator called Border-Align to extract ``border features'' from the extreme point of the border to enhance the point feature. Based on the BorderAlign, we design a novel detection architecture called BorderDet, which explicitly exploits the border information for stronger classification and more accurate localization. With ResNet-50 backbone, our method improves single-stage detector FCOS by 2.8 AP gains (38.6 v.s. 41.4). With the ResNeXt-101-DCN backbone, our BorderDet obtains 50.3 AP, outperforming the existing state-of-the-art approaches. "



Paperid:33
Authors:Genki Osada, Budrul Ahsan, Revoti Prasad Bora, Takashi Nishide
Title: Regularization with Latent Space Virtual Adversarial Training
Abstract:
Virtual Adversarial Training (VAT) has shown impressive results among recently developed regularization methods called consistency regularization. VAT utilizes adversarial samples, generated by injecting perturbation in the input space, for training and thereby enhances the generalization ability of a classifier. However, such adversarial samples can be generated only within a very small area around the input data point, which limits the adversarial effectiveness of such samples. To address this problem we propose LVAT (Latent space VAT), which injects perturbation in the latent space instead of the input space. LVAT can generate adversarial samples flexibly, resulting in more adverse effect and thus more effective regularization. The latent space is built by a generative model, and in this paper we examine two type of models: variational auto-encoder and normalizing flow, specifically Glow. We evaluated the performance of our method in both supervised and semi-supervised learning scenarios for an image classification task using SVHN and CIFAR-10 datasets. In our evaluation, we found that our method outperforms VAT and other state-of-the-art methods."



Paperid:34
Authors:Yinda Zhang, Neal Wadhwa, Sergio Orts-Escolano, Christian Häne, Sean Fanello, Rahul Garg
Title: Du²Net: Learning Depth Estimation from Dual-Cameras and Dual-Pixels
Abstract:
Computational stereo has reached a high level of accuracy, but degrades in the presence of occlusions, repeated textures, and correspondence errors along edges. We present a novel approach based on neural networks for depth estimation that combines stereo from dual cameras with stereo from a dual-pixel sensor, which is increasingly common on consumer cameras. Our network uses a novel architecture to fuse these two sources of information and can overcome the above-mentioned limitations of pure binocular stereo matching. Our method provides a dense depth map with sharp edges, which is crucial for computational photography applications like synthethic shallow-depth-of-field or 3D Photos. Additionally, we avoid the inherent ambiguity due to the aperture problem in stereo cameras by designing the stereo baseline to be orthogonal to the dual-pixel baseline. We present experiments and comparisons with state-of-the-art approaches that show that our method offers a substantial improvement over previous works."



Paperid:35
Authors:Jaekyeom Kim, Hyoungseok Kim, Gunhee Kim
Title: Model-Agnostic Boundary-Adversarial Sampling for Test-Time Generalization in Few-Shot learning
Abstract:
Few-shot learning is an important research problem that tackles one of the greatest challenges of machine learning: learning a new task from a limited amount of labeled data. We propose a model-agnostic method that improves the test-time performance of any few-shot learning models with no additional training, and thus is free from the training-test domain gap. Based on only the few support samples in a meta-test task, our method generates the samples adversarial to the base few-shot classifier's boundaries and fine-tunes its embedding function in the direction that increases the classification margins of the adversarial samples. Consequently, the embedding space becomes denser around the labeled samples which makes the classifier robust to query samples. Experimenting on miniImageNet, CIFAR-FS, and FC100, we demonstrate that our method brings significant performance improvement to three different base methods with various properties, and achieves the state-of-the-art performance in a number of few-shot learning tasks."



Paperid:36
Authors:Jiawang Bai, Bin Chen, Yiming Li, Dongxian Wu, Weiwei Guo, Shu-Tao Xia, En-Hui Yang
Title: Targeted Attack for Deep Hashing based Retrieval
Abstract:
The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the $ll^\infty$ restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval."



Paperid:37
Authors:Hongwei Yong, Jianqiang Huang, Xiansheng Hua, Lei Zhang
Title: Gradient Centralization: A New Optimization Technique for Deep Neural Networks
Abstract:
Optimization techniques are of great importance to effectively and efficiently train a deep neural network (DNN). It has been shown that using the first and second order statistics (e.g., mean and variance) to perform Z-score standardization on network activations or weight vectors, such as batch normalization (BN) and weight standardization (WS), can improve the training performance. Different from those previous methods that mostly operate on activations or weights, we present a new optimization technique, namely gradient centralization (GC), which operates directly on gradients by centralizing the gradient vectors to have zero mean. GC can be viewed as a projected gradient descent method with a constrained loss function. We show that GC can regularize both the weight space and output feature space so that it can boost the generalization performance of DNNs. Moreover, GC improves the Lipschitzness of the loss function and its gradient so that the training process becomes more efficient and stable. GC is very simple to implement and can be easily embedded into existing gradient based DNN optimizers with only one line of code. It can also be directly used to fine-tune the pre-trained DNNs. Our experiments on various applications, including general image classification, fine-grained image classification, detection and segmentation, demonstrate that GC can consistently improve the performance of DNN learning."



Paperid:38
Authors:Jirong Zhang, Chuan Wang, Shuaicheng Liu, Lanpeng Jia, Nianjin Ye, Jue Wang, Ji Zhou, Jian Sun
Title: Content-Aware Unsupervised Deep Homography Estimation
Abstract:
Homography estimation is a basic image alignment method in many applications. It is usually done by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real world applications. To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art including deep solutions and feature-based solutions."



Paperid:39
Authors:Mihai Dusmanu, Johannes L. Schönberger, Marc Pollefeys
Title: Multi-View Optimization of Local Feature Geometry
Abstract:
In this work, we address the problem of refining the geometry of local image features from multiple views without known scene or camera geometry. Current approaches to local feature detection are inherently limited in their keypoint localization accuracy because they only operate on a single view. This limitation has a negative impact on downstream tasks such as Structure-from-Motion, where inaccurate keypoints lead to large errors in triangulation and camera localization. Our proposed method naturally complements the traditional feature extraction and matching paradigm. We first estimate local geometric transformations between tentative matches and then optimize the keypoint locations over multiple views jointly according to a non-linear least squares formulation. Throughout a variety of experiments, we show that our method consistently improves the triangulation and camera localization performance for both hand-crafted and learned local features."



Paperid:40
Authors:Jingjing Shen, Thomas J. Cashman, Qi Ye, Tim Hutton, Toby Sharp, Federica Bogo, Andrew Fitzgibbon, Jamie Shotton
Title: The Phong Surface: Efficient 3D Model Fitting using Lifted Optimization
Abstract:
Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a continuous, real-time basis while sharing a single Digital Signal Processor. To solve model-fitting problems for HoloLens 2 hand tracking, where the computational budget is approximately 100 times smaller than an iPhone 7, we introduce a new surface model: the `Phong surface'. Using ideas from computer graphics, the Phong surface describes the same 3D shape as a triangulated mesh model, but with continuous surface normals which enable the use of lifting-based optimization, providing significant efficiency gains over ICP-based methods. We show that Phong surfaces retain the convergence benefits of smoother surface models, while triangle meshes do not."



Paperid:41
Authors:Miao Liu, Siyu Tang, Yin Li, James M. Rehg
Title: Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video
Abstract:
We address the challenging task of anticipating human-object interaction in first person videos. Most existing methods either ignore how the camera wearer interacts with objects, or simply considers body motion as a separate modality. In contrast, we observe that the intentional hand movement reveals critical information about the future activity. Motivated by this observation, we adopt intentional hand movement as a feature representation, and propose a novel deep network that jointly models and predicts the egocentric hand motion, interaction hotspots and future action. Specifically, we consider the future hand motion as the motor attention, and model this attention using probabilistic variables in our deep model. The predicted motor attention is further used to select the discriminative spatial-temporal visual features for predicting actions and interaction hotspots. We present extensive experiments demonstrating the benefit of the proposed joint model. Importantly, our model produces new state-of-the-art results for action anticipation on both EGTEA Gaze+ and the EPIC-Kitchens datasets. Our project page is available at https://aptx4869lm.github.io/ForecastingHOI/"



Paperid:42
Authors:Jamie Watson, Oisin Mac Aodha, Daniyar Turmukhambetov, Gabriel J. Brostow, Michael Firman
Title: Learning Stereo from Single Images
Abstract:
Supervised deep networks are among the best methods for finding correspondences in stereo image pairs. Like all supervised approaches, these networks require ground truth data during training. However, collecting large quantities of accurate dense correspondence data is very challenging. We propose that it is unnecessary to have such a high reliance on ground truth depths or even corresponding stereo pairs. Inspired by recent progress in monocular depth estimation, we generate plausible disparity maps from single images. In turn, we use those flawed disparity maps in a carefully designed pipeline to generate stereo training pairs. Training in this manner makes it possible to convert any collection of single RGB images into stereo training data. This results in a significant reduction in human effort, with no need to collect real depths or to hand-design synthetic data. We can consequently train a stereo matching network from scratch on datasets like COCO, which were previously hard to exploit for stereo. Through extensive experiments we show that our approach outperforms stereo networks trained with standard synthetic datasets, when evaluated on KITTI, ETH3D, and Middlebury."



Paperid:43
Authors:Jinlu Liu, Liang Song, Yongqiang Qin
Title: Prototype Rectification for Few-Shot Learning
Abstract:
Few-shot learning requires to recognize novel classes with scarce labeled data. Prototypical network is useful in existing researches, however, training on narrow-size distribution of scarce data usually tends to get biased prototypes. In this paper, we figure out two key influencing factors of the process: the intra-class bias and the cross-class bias. We then propose a simple yet effective approach for prototype rectification in transductive setting. The approach utilizes label propagation to diminish the intra-class bias and feature shifting to diminish the cross-class bias. We also conduct theoretical analysis to derive its rationality as well as the lower bound of the performance. Effectiveness is shown on three few-shot benchmarks. Notably, our approach achieves state-of-the-art performance on both miniImageNet (70.31% on 1-shot and 81.89% on 5-shot) and tieredImageNet (78.74% on 1-shot and 86.92% on 5-shot)."



Paperid:44
Authors:Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, Noah Snavely
Title: Learning Feature Descriptors using Camera Pose Supervision
Abstract:
Recent research on learned visual descriptors has shown promising improvements in correspondence estimation, a key component of many 3D vision tasks. However, existing descriptor learning frameworks typically require ground-truth correspondences between feature points for training, which are challenging to acquire at scale. In this paper we propose a novel weakly-supervised framework that can learn feature descriptors solely from relative camera poses between images. To do so, we devise both a new loss function that exploits the epipolar constraint given by camera poses, and a new model architecture that makes the whole pipeline differentiable and efficient. Because we no longer need pixel-level ground-truth correspondences, our framework opens up the possibility of training on much larger and more diverse datasets for better and unbiased descriptors. We call the resulting descriptors CAmera Pose Supervised, or CAPS, descriptors. Though trained with weak supervision, CAPS descriptors outperform even prior fully-supervised descriptors and achieve state-of-the-art performance on a variety of geometric tasks."



Paperid:45
Authors:Xiangtai Li, Ansheng You, Zhen Zhu, Houlong Zhao, Maoke Yang, Kuiyuan Yang, Shaohua Tan, Yunhai Tong
Title: Semantic Flow for Fast and Accurate Scene Parsing
Abstract:
In this paper, we focus on designing effective method for fast and accurate scene parsing. A common practice to improve the performance is to attain high resolution feature maps with strong semantic representation. Two strategies are widely used---atrous convolutions and feature pyramid fusion, are either computation intensive or ineffective. Inspired by the Optical Flow for motion alignment between adjacent video frames, we propose a Flow Alignment Module (FAM) to learn Semantic Flow between feature maps of adjacent levels, and broadcast high-level features to high resolution features effectively and efficiently. Furthermore, integrating our module to a common feature pyramid structure exhibits superior performance over other real-time methods even on light-weight backbone networks, such as ResNet-18. Extensive experiments are conducted on several challenging datasets, including Cityscapes, PASCAL Context, ADE20K and CamVid. Especially, our network is the first to achieve 80.4\% mIoU on Cityscapes with a frame rate of 26 FPS. The code is available at \url{https://github.com/donnyyou/torchcv}."



Paperid:46
Authors:Jogendra Nath Kundu, Mugalodi Rakesh, Varun Jampani, Rahul Mysore Venkatesh, R. Venkatesh Babu
Title: Appearance Consensus Driven Self-Supervised Human Mesh Recovery
Abstract:
We present a self-supervised human mesh recovery framework to infer human pose and shape from monocular images in the absence of any paired supervision. Recent advances have shifted the interest towards directly regressing parameters of a parametric human model by supervising them on large-scale, images with 2D landmark annotations. This limits the generalizability of such approaches to operate on samples from unlabeled wild environments. Acknowledging this we propose a novel appearance consensus driven self-supervised objective. To effectively disentangle the foreground (FG) human we rely on image pairs depicting the same person (consistent FG) in varied pose and background (BG) which are obtained from unlabeled wild videos. The proposed FG appearance consistency objective makes use of a novel, differentiable extit{Color-recovery} module to obtain vertex colors without involving any trainable appearance extraction network; via efficient realization of color-picking and reflectional symmetry. We achieve state-of-the-art results on the standard model-based 3D pose estimation benchmarks at comparable supervision levels. Furthermore, the resulting colored mesh prediction opens up usage of our framework for a variety of appearance-related tasks beyond pose and shape estimation, thus establishing our superior generalizability."



Paperid:47
Authors:Mark Sheinin, Dinesh N. Reddy, Matthew O’Toole, Srinivasa G. Narasimhan
Title: Diffraction Line Imaging
Abstract:
We present a novel computational imaging principle that combines diffractive optics with line (1D) sensing. When light passes through a diffraction grating, it disperses as a function of wavelength. We exploit this principle to recover 2D and even 3D positions from only line images. We derive a detailed image formation model and a learning-based algorithm for 2D position estimation. We show several extensions of our system to improve the accuracy of the 2D positioning and expand the effective field of view. We demonstrate our approach in two applications: (a) fast passive imaging of sparse light sources like street lamps, headlights at night and LED-based motion capture, and (b) structured light 3D scanning with line illumination and line sensing. Line imaging has several advantages over 2D sensors: high frame rate, high dynamic range, high fill-factor with additional on-chip computation, low-cost beyond the visible spectrum, and high energy efficiency when used with line illumination. Thus, our system is able to achieve high-speed and high-accuracy 2D positioning of light sources and 3D scanning of scenes."



Paperid:48
Authors:Minyoung Huh, Richard Zhang, Jun-Yan Zhu, Sylvain Paris, Aaron Hertzmann
Title: Aligning and Projecting Images to Class-conditional Generative Networks
Abstract:
We present a method for projecting an input image into the space of a class-conditional generative neural network. We propose a method that optimizes for transformation to counteract the model biases in generative neural networks. Specifically, we demonstrate that one can solve for image translation, scale, and global color transformation, during the projection optimization to address the object-center bias and color bias of a Generative Adversarial Network. This projection process poses a difficult optimization problem, and purely gradient-based optimizations fail to find good solutions. We describe a hybrid optimization strategy that finds good projections by estimating transformations and class parameters. We show the effectiveness of our method on real images and further demonstrate how the corresponding projections lead to better editability of these images."



Paperid:49
Authors:Xiaoqi Zhao, Youwei Pang, Lihe Zhang, Huchuan Lu, Lei Zhang
Title: Suppress and Balance: A Simple Gated Network for Salient Object Detection
Abstract:
Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed “Fold” operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics."



Paperid:50
Authors:Chen Wang, Wenshan Wang, Yuheng Qiu, Yafei Hu, Sebastian Scherer
Title: Visual Memorability for Robotic Interestingness via Unsupervised Online Learning
Abstract:
In this paper, we explore the problem of interesting scene prediction for mobile robots. This area is currently underexplored but is crucial for many practical applications such as autonomous exploration and decision making. Inspired by industrial demands, we first propose a novel translation-invariant visual memory for recalling and identifying interesting scenes, then design a three-stage architecture of long-term, short-term, and online learning. This enables our system to learn human-like experience, environmental knowledge, and online adaption, respectively. Our approach achieves much higher accuracy than the state-of-the-art algorithms on challenging robotic interestingness datasets."



Paperid:51
Authors:Jun Fang, Ali Shafiee, Hamzah Abdel-Aziz, David Thorsley, Georgios Georgiadis, Joseph H. Hassoun
Title: Post-Training Piecewise Linear Quantization for Deep Neural Networks
Abstract:
Quantization plays an important role in the energy-efficient deployment of Deep Neural Networks (DNNs) on resource-limited devices. Post-training quantization is highly desirable since it does not require retraining or access to the full training dataset. The well-established uniform scheme for post-training quantization achieves satisfactory results by converting DNNs from full-precision to 8-bit fixed-point integers. However, it suffers from significant performance degradation when quantizing to lower bit-widths. In this paper, we propose a PieceWise Linear Quantization (PWLQ) scheme to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. Our approach breaks the entire quantization range into non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. Optimal breakpoints that divide the entire range are found by minimizing the quantization error. Compared to state-of-the-art post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead. "



Paperid:52
Authors:Yang Zou, Xiaodong Yang, Zhiding Yu, B.V.K. Vijaya Kumar, Jan Kautz
Title: Joint Disentangling and Adaptation for Cross-Domain Person Re-Identification
Abstract:
Although a significant progress has been witnessed in supervised person re-identification (re-id), it remains challenging to generalize re-id models to new domains due to the huge domain gaps. Recently, there has been a growing interest in using unsupervised domain adaptation to address this scalability issue. Existing methods typically conduct adaptation on the representation space that contains both id-related and id-unrelated factors, thus inevitably undermining the adaptation efficacy of id-related features. In this paper, we seek to improve adaptation by purifying the representation space to be adapted. To this end, we propose a joint learning framework that disentangles id-related/unrelated features and enforces adaptation to work on the id-related feature space exclusively. Our model involves a disentangling module that encodes cross-domain images into a shared appearance space and two separate structure spaces, and an adaptation module that performs adversarial alignment and self-training on the shared appearance space. The two modules are co-designed to be mutually beneficial. Extensive experiments demonstrate that the proposed joint learning framework outperforms the state-of-the-art methods by clear margins."



Paperid:53
Authors:Lijie Fan, Tianhong Li, Yuan Yuan, Dina Katabi
Title: In-Home Daily-Life Captioning Using Radio Signals
Abstract:
This paper aims to caption daily life --i.e., to create a textual description of people's activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home's floormap. RF-Diary can further observe and caption people's life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people's 3D dynamics, and use the floormap to help the model learn people's interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions."



Paperid:54
Authors:Zeyi Huang, Haohan Wang, Eric P. Xing, Dong Huang
Title: Self-Challenging Improves Cross-Domain Generalization
Abstract:
Convolutional Neural Networks (CNN) conduct image classification by activating dominant features that correlated with labels. When the training and testing data are under similar distributions, their dominant features are similar, leading to decent test performance. The performance is nonetheless unmet when tested with different distributions, leading to the challenges in cross-domain image classification. We introduce a simple training heuristic, Representation Self-Challenging (RSC), that significantly improves the generalization of CNN to the out-of-domain data. RSC iteratively challenges (discards) the dominant features activated on the training data, and forces the network to activate remaining features that correlate with labels. This process appears to activate feature representations applicable to out-of-domain data without prior knowledge of the new domain and without learning extra network parameters. We present the theoretical properties and conditions of RSC for improving cross-domain generalization. The experiments endorse the simple, effective, and architecture-agnostic nature of our RSC method."



Paperid:55
Authors:Qing Li, Siyuan Huang, Yining Hong, Song-Chun Zhu
Title: A Competence-aware Curriculum for Visual Concepts Learning via Question Answering
Abstract:
Humans can progressively learn visual concepts from easy to hard questions. To mimic this efficient learning ability, we propose a competence-aware curriculum for visual concept learning in a question-answering manner. Specifically, we design a neural-symbolic concept learner for learning the visual concepts and a multi-dimensional Item Response Theory (mIRT) model for guiding the learning process with an adaptive curriculum. The mIRT effectively estimates the concept difficulty and the model competence at each learning step from accumulated model responses. The estimated concept difficulty and model competence are further utilized to select the most profitable training samples. Experimental results on CLEVR show that with a competence-aware curriculum, the proposed method achieves state-of-the-art performances with superior data efficiency and convergence speed. Specifically, the proposed model only uses 40% of training data and converges three times faster compared with other state-of-the-art methods. "



Paperid:56
Authors:Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song , Junfeng Yang, Carl Vondrick
Title: Multitask Learning Strengthens Adversarial Robustness
Abstract:
Although deep networks achieve strong accuracy on a range of computer vision benchmarks, they remain vulnerable to adversarial attacks, where imperceptible input perturbations fool the network. We present both theoretical and empirical analyses that connect the adversarial robustness of a model to the number of tasks that it is trained on. Experiments on two datasets show that attack difficulty increases as the number of target tasks increase. Moreover, our results suggest that when models are trained on multiple tasks at once, they become more robust to adversarial attacks on individual tasks. While adversarial defense remains an open challenge, our results suggest that deep networks are vulnerable partly because they are trained on too few tasks."



Paperid:57
Authors:Zhihang Yuan, Bingzhe Wu, Guangyu Sun, Zheng Liang, Shiwan Zhao, Weichen Bi
Title: S2DNAS: Transforming Static CNN Model for Dynamic Inference via Neural Architecture Search
Abstract:
Recently, dynamic inference has emerged as a promising way to reduce the computational cost of deep convolutional neural networks (CNNs). In contrast to static methods (e.g., weight pruning), dynamic inference adaptively adjusts the inference process according to each input sample, which can considerably reduce the computational cost on ""easy"" samples while maintaining the overall model performance. In this paper, we introduce a general framework, S2DNAS, which can transform various static CNN models to support dynamic inference via neural architecture search. To this end, based on a given CNN model, we first generate a CNN architecture space in which each architecture is a multi-stage CNN generated from the given model using some predefined transformations. Then, we propose a reinforcement learning based approach to automatically search for the optimal CNN architecture in the generated space. At last, with the searched multi-stage network, we can perform dynamic inference by adaptively choosing a stage to evaluate for each sample. Unlike previous works that introduce irregular computations or complex controllers in the inference or re-design a CNN model from scratch, our method can generalize to most of the popular CNN architectures and the searched dynamic network can be directly deployed using existing deep learning frameworks in various hardware devices."



Paperid:58
Authors:Zhihao Hu, Zhenghao Chen, Dong Xu, Guo Lu, Wanli Ouyang, Shuhang Gu
Title: Improving Deep Video Compression by Resolution-adaptive Flow Coding
Abstract:
In the learning based video compression approaches, it is an essential issue to compress pixel-level optical flow maps by developing new motion vector (MV) encoders. In this work, we propose a new framework called Resolution-adaptive Flow Coding (RaFC) to effectively compress the flow maps globally and locally, in which we use multi-resolution representations instead of single-resolution representations for both the input flow maps and the output motion features of the MV encoder. To handle complex or simple motion patterns globally, our frame-level scheme RaFC-frame automatically decides the optimal flow map resolution for each video frame. To cope different types of motion patterns locally, our block-level scheme called RaFC-block can also select the optimal resolution for each local block of motion features. In addition, the rate-distortion criterion is applied to both RaFC-frame and RaFC-block and select the optimal motion coding mode for effective flow coding. Comprehensive experiments on four benchmark datasets HEVC, VTL, UVG and MCL-JCV clearly demonstrate the effectiveness of our overall RaFC framework after combing RaFC-frame and RaFC-block for video compression."



Paperid:59
Authors:Junting Dong, Qing Shuai, Yuanqing Zhang, Xian Liu, Xiaowei Zhou, Hujun Bao
Title: Motion Capture from Internet Videos
Abstract:
Recent advances in image-based human pose estimation make it possible to capture 3D human motion from a single RGB video. However, the inherent depth ambiguity and self-occlusion in a single view prohibit the recovery of as high-quality motion as multi-view reconstruction. While multi-view videos are not common, the videos of a celebrity performing a specific action are usually abundant on the Internet. Even if these videos were recorded at different time instances, they would encode the same motion characteristics of the person. Therefore, we propose to capture human motion by jointly analyzing these Internet videos instead of using single videos separately. However, this new task poses many new challenges that cannot be addressed by existing methods, as the videos are unsynchronized, the camera viewpoints are unknown, the background scenes are different, and the human motions are not exactly the same among videos. To address these challenges, we propose a novel optimization-based framework and experimentally demonstrate its ability to recover much more precise and detailed motion from multiple videos, compared against monocular motion capture methods. "



Paperid:60
Authors:Xinqian Gu, Hong Chang, Bingpeng Ma, Hongkai Zhang, Xilin Chen
Title: Appearance-Preserving 3D Convolution for Video-based Person Re-identification
Abstract:
Due to the imperfect person detection results and posture changes, temporal appearance misalignment is unavoidable in video-based person re-identification (ReID). In this case, 3D convolution may destroy the appearance representation of person video clips, thus it is harmful to ReID. To address this problem, we propose Appearance-Preserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel. With APM aligning the adjacent feature maps in pixel level, the following 3D convolution can model temporal information on the premise of maintaining the appearance representation quality. It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds. Extensive experiments demonstrate the effectiveness of AP3D for video-based ReID and the results on three widely used datasets surpass the state-of-the-arts. Code is available at: https://github.com/guxinqian/AP3D."



Paperid:61
Authors:Dylan Campbell, Liu Liu, Stephen Gould
Title: Solving the Blind Perspective-n-Point Problem End-To-End With Robust Differentiable Geometric Optimization
Abstract:
Blind Perspective-n-Point (PnP) is the problem of estimating the position and orientation of a camera relative to a scene, given 2D image points and 3D scene points, without prior knowledge of the 2D-3D correspondences. Solving for pose and correspondences simultaneously is extremely challenging since the search space is very large. Fortunately it is a coupled problem: the pose can be found easily given the correspondences and vice versa. Existing approaches assume that noisy correspondences are provided, that a good pose prior is available, or that the problem size is small. We instead propose the first fully end-to-end trainable network for solving the blind PnP problem efficiently and globally, that is, without the need for pose priors. We make use of recent results in differentiating optimization problems to incorporate geometric model fitting into an end-to-end learning framework, including Sinkhorn, RANSAC and PnP algorithms. Our proposed approach significantly outperforms other methods on synthetic and real data."



Paperid:62
Authors:Xingang Pan, Xiaohang Zhan, Bo Dai, Dahua Lin, Chen Change Loy, Ping Luo
Title: Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation
Abstract:
Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig.1, the deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images. It also enables diverse image manipulation including random jittering, image morphing, and category transfer. Such highly flexible effects are made possible through relaxing the assumption of existing GAN-inversion methods, which tend to fix the generator. Notably, we allow the generator to be fine-tuned on-the-fly in a progressive manner regularized by feature distance obtained by the discriminator in GAN. We show that these easy-to-implement and practical changes help preserve the reconstruction to remain in the manifold of nature image, and thus lead to more precise and faithful reconstruction for real images. Code is at https://github.com/XingangPan/deep-generative-prior."



Paperid:63
Authors:Mantang Guo, Junhui Hou, Jing Jin, Jie Chen, Lap-Pui Chau
Title: Deep Spatial-angular Regularization for Compressive Light Field Reconstruction over Coded Apertures
Abstract:
Coded aperture is a promising approach for capturing the 4-D light field (LF), in which the 4-D data are compressively modulated into 2-D coded measurements that are further decoded by reconstruction algorithms. The bottleneck lies in the reconstruction algorithms, resulting in rather limited reconstruction quality. To tackle this challenge, we propose a novel learning-based framework for the reconstruction of high-quality LFs from acquisitions via learned coded apertures. The proposed method incorporates the measurement observation into the deep learning framework elegantly to avoid relying entirely on data-driven priors for LF reconstruction. Specifically, we first formulate the compressive LF reconstruction as an inverse problem with an implicit regularization term. Then, we construct the regularization term with an efficient deep spatial-angular convolutional sub-network to comprehensively explore the signal distribution free from the limited representation ability and inefficiency of deterministic mathematical modeling. Experimental results show that the reconstructed LFs not only achieve much higher PSNR/SSIM but also preserve the LF parallax structure better, compared with state-of-the-art methods on both real and synthetic LF benchmarks. In addition, experiments show that our method is efficient and robust to noise, which is an essential advantage for a real camera system. The code is publicly available at https://github.com/angmt2008/LFCA."



Paperid:64
Authors:Xuesong Niu, Zitong Yu, Hu Han, Xiaobai Li, Shiguang Shan, Guoying Zhao
Title: Video-based Remote Physiological Measurement via Cross-verified Feature Disentangling
Abstract:
Remote physiological measurements, e.g., remote photoplethysmography (rPPG) based heart rate (HR), heart rate variability (HRV) and respiration frequency (RF) measuring, are playing more and more important roles under the application scenarios where contact measurement is inconvenient or impossible. Since the amplitude of the physiological signals is usually very small, they can be easily affected by head movements, lighting conditions, and sensor diversities. To address these challenges, we propose a cross-verified feature disentangling strategy to disentangle the physiological features with non-physiological representations such as head movements and lighting conditions, and then use the distilled physiological features for robust multi-task physiological measurements. We first transform the input face videos into a multi-scale spatial-temporal map (MSTmap), which can suppress the irrelevant background and noise features while retaining most of the temporal characteristics of the periodic physiological signals. Then we take pairwise MSTmaps as inputs to an autoencoder architecture with two encoders (one for physiological signals and the other for non-physiological information) and use a cross-verified scheme to obtain physiological features disentangled with the non-physiological features. The disentangled features are finally used for the joint prediction of multiple physiological signals like average HR values and rPPG signals. Comprehensive experiments on different large-scale public datasets of multiple physiological measurement tasks as well as the cross-database testing demonstrate the robustness of our approach."



Paperid:65
Authors:Bharat Lal Bhatnagar, Cristian Sminchisescu, Christian Theobalt, Gerard Pons-Moll
Title: Combining Implicit Function Learning and Parametric Models for 3D Human Reconstruction
Abstract:
Implicit functions represented as deep learning approximations are powerful for reconstructing 3D surfaces. However, they can only produce static surfaces that are not controllable, which provides limited ability to modify the resulting model by editing its pose or shape parameters.Implicit functions represented as deep learning approximations are powerful for reconstructing 3D surfaces. However, they can only produce static surfaces that are not controllable, which provides limited ability to modify the resulting model by editing its pose or shape parameters. Nevertheless, such features are essential in building flexible models for both computer graphics and computer vision. In this work, we present methodology that combines detail-rich implicit functions and parametric representations in order to reconstruct 3D models of people that remain controllable and accurate even in the presence of clothing. Given sparse 3D point clouds sampled on the surface of a dressed person, we use an Implicit Part Network (IP-Net) to jointly predict the outer 3D surface of the dressed person, the inner body surface, and the semantic correspondences to a parametric body model. We subsequently use correspondences to fit the body model to our inner surface and then non-rigidly deform it (under a parametric body + displacement model) to the outer surface in order to capture garment, face and hair detail. In quantitative and qualitative experiments with both full body data and hand scans we show that the proposed methodology generalizes, and is effective even given incomplete point clouds collected from single-view depth images. We will release our models and code."



Paperid:66
Authors:Tsai-Shien Chen, Chih-Ting Liu, Chih-Wei Wu, Shao-Yi Chien
Title: Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network
Abstract:
Vehicle re-identification (re-ID) focuses on matching images of the same vehicle across different cameras. It is fundamentally challenging because differences between vehicles are sometimes subtle. While several studies incorporate spatial-attention mechanisms to help vehicle re-ID, they often require expensive keypoint labels or suffer from noisy attention if not trained with expensive labels. In this work, we propose a dedicated Semantics-guided Part Attention Network (SPAN) to robustly predict part attention masks for different views of vehicles given only image-level semantic labels during training. With the help of part attention masks, we can extract discriminative features in each part separately. Then we introduce Co-occurrence Part-attentive Distance Metric (CPDM) which places greater emphasis on co-occurrence vehicle parts when evaluating the feature distance of two images. Extensive experiments validate the effectiveness of the proposed method and show that our framework outperforms the state-of-the-art approaches."



Paperid:67
Authors:Guolei Sun, Wenguan Wang, Jifeng Dai, Luc Van Gool
Title: Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation
Abstract:
This paper studies the problem of learning semantic segmentation from image-level supervision only. Current popular solutions leverage object localization maps from classifiers as supervision signals, and struggle to make the localization maps capture more complete object content. Rather than previous efforts that primarily focus on intra-image information, we address the value of cross-image semantic relations for comprehensive object pattern mining. To achieve this, two neural co-attentions are incorporated into the classifier to complimentarily capture cross-image semantic similarities and differences. In particular, given a pair of training images, one co-attention enforces the classifier to recognize the common semantics from co-attentive objects, while the other one, called contrastive co-attention, drives the classifier to identify the unshared semantics from the rest, uncommon objects. This helps the classifier discover more object patterns and better ground semantics in image regions. In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference, hence eventually benefiting semantic segmentation learning. More essentially, our algorithm provides a unified framework that handles well different WSSS settings, i.e., learning WSSS with (1) precise image-level supervision only, (2) extra simple single-label data, and (3) extra noisy web data. It sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability."



Paperid:68
Authors:Stefan Popov, Pablo Bauszat, Vittorio Ferrari
Title: CoReNet: Coherent 3D Scene Reconstruction from a Single RGB Image
Abstract:
Advances in deep learning techniques have allowed recent work to reconstruct the shape of a single object given only one RBG image as input. Building on common encoder-decoder architectures for this task, we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume. We validate the impact of our contributions experimentally both on synthetic data from ShapeNet as well as real images from Pix3D. Our method improves over the state-of-the-art single-object methods on both datasets. Finally, we evaluate performance quantitatively on multiple object reconstruction with synthetic scenes assembled from ShapeNet objects. "



Paperid:69
Authors:Lei Huang, Jie Qin, Li Liu, Fan Zhu, Ling Shao
Title: Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs
Abstract:
Conditioning analysis uncovers the landscape of an optimization objective by exploring the spectrum of its curvature matrix. This has been well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum, which has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we find that the last linear layer of a very deep residual network displays ill-conditioned behavior. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original and pre-activation residual networks."



Paperid:70
Authors:Zachary Teed, Jia Deng
Title: RAFT: Recurrent All-Pairs Field Transforms for Optical Flow
Abstract:
We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for estimating optical flow. RAFT extracts per-pixel features, builds multi-scale 4D correlation volumes for all pairs of pixels, and iteratively updates a flow field through a recurrent unit that performs lookups on the correlation volumes. RAFT achieves state-of-the-art performance on both KITTI and Sintel, with strong cross-dataset generalization and high efficiency in inference time, training speed, and parameter count. "



Paperid:71
Authors:Feihu Zhang, Xiaojuan Qi, Ruigang Yang, Victor Prisacariu, Benjamin Wah, Philip Torr
Title: Domain-invariant Stereo Matching Networks
Abstract:
State-of-the-art stereo matching networks have difficulties in generalizing to new unseen environments due to significant domain differences, such as color, illumination, contrast, and texture. In this paper, we aim at designing a domain-invariant stereo matching network (DSMNet) that generalizes well to unseen scenes. To achieve this goal, we propose i) a novel ``""domain normalization"" approach that regularizes the distribution of learned representations to allow them to be invariant to domain differences, and ii) a trainable non-local graph-based filter for extracting robust structural and geometric representations that can further enhance domain-invariant generalizations. When trained on synthetic data and generalized to real test sets, our model performs significantly better than all state-of-the-art models. It even outperforms some deep learning models (e.g. DispNet, Content-CNN and MC-CNN) fine-tuned with test-domain data."



Paperid:72
Authors:Gyeongsik Moon, Takaaki Shiratori, Kyoung Mu Lee
Title: DeepHandMesh: A Weakly-supervised Deep Encoder-Decoder Framework for High-fidelity Hand Mesh Modeling
Abstract:
Human hands play a central role in interacting with other people and objects. For realistic replication of such hand motions, high-fidelity hand meshes have to be reconstructed. In this study, we firstly propose DeepHandMesh, a weakly-supervised deep encoder-decoder framework for high-fidelity hand mesh modeling. We design our system to be trained in an end-to-end and weakly-supervised manner; therefore, it does not require groundtruth meshes. Instead, it relies on weaker supervisions such as 3D joint coordinates and multi-view depth maps, which are easier to get than groundtruth meshes and do not dependent on the mesh topology. Although the proposed DeepHandMesh is trained in a weakly-supervised way, it provides significantly more realistic hand mesh than previous fully-supervised hand models. Our newly introduced penetration avoidance loss further improves results by replicating physical interaction between hand parts. Finally, we demonstrate that our system can also be applied successfully to the 3D hand mesh estimation from general images. Our hand model, dataset, and codes are publicly available ootnote{\url{https://mks0601.github.io/DeepHandMesh/}}."



Paperid:73
Authors:Guo Lu, Chunlei Cai, Xiaoyun Zhang, Li Chen, Wanli Ouyang, Dong Xu , Zhiyong Gao
Title: Content Adaptive and Error Propagation Aware Deep Video Compression
Abstract:
Recently, learning based video compression methods attract increasing attention. However, previous works suffer from error propagation, which stems from the accumulation of reconstructed error in inter predictive coding. Meanwhile, previous learning based video codecs are also not adaptive to different video contents. To address these two problems, we propose a content adaptive and error propagation aware video compression system. Specifically, our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame. Based on the learned long-term temporal information, our approach effectively alleviates error propagation in reconstructed frames. More importantly, instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme for the learned video compression. The proposed approach updates the parameters for encoder according to the rate-distortion criterion but keeps the decoder unchanged in the inference stage. Therefore, the encoder is adaptive to different video contents and achieves better compression efficiency by reducing the domain gap between the training and testing datasets. Our method is simple yet effective and outperforms the state-of-the-art learning based video codec on benchmark datasets without increasing the model size or decreasing the decoding speed."



Paperid:74
Authors:Mengtian Li, Yu-Xiong Wang, Deva Ramanan
Title: Towards Streaming Perception
Abstract:
Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the algorithmic trade-off between latency and accuracy, there has not been a clear metric to compare different methods along the Pareto optimal latency-accuracy curve. We point out a discrepancy between standard offline evaluation and real-time applications: by the time an algorithm finishes processing a particular image frame, the surrounding world has changed. To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as ""streaming accuracy"". The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any image understanding task into a streaming perception task. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations. Our proposed solutions and their empirical analysis demonstrate a number of surprising conclusions: (1) there exists an optimal ""sweet spot"" that maximizes streaming accuracy along the Pareto optimal latency-accuracy curve, (2) asynchronous tracking and future forecasting naturally emerge as internal representations that enable streaming image understanding, and (3) dynamic scheduling can be used to overcome temporal aliasing, yielding the paradoxical result that latency is sometimes minimized by sitting idle and ""doing nothing""."



Paperid:75
Authors:Rakshith Shetty, Mario Fritz, Bernt Schiele
Title: Towards Automated Testing and Robustification by Semantic Adversarial Data Generation
Abstract:
Widespread application of computer vision systems in real world tasks is currently hindered by their unexpected behavior on unseen examples. This occurs due to limitations of empirical testing on finite test sets and lack of systematic methods to identify the breaking points of a trained model. In this work we propose semantic adversarial editing,a method to synthesize plausible but difficult data points on which our target model breaks down. We achieve this with a differentiable object synthesizer which can change an object’s appearance while retaining its pose. Constrained adversarial optimization of object appearance through this synthesizer produces rare/difficult versions of an object which fool the target object detector. Experiments show that our approach effectively synthesizes difficult test data, dropping the performance of YoloV3 detector by more than 20 mAP points by changing the appearance of a single object and discovering failure modes of the model. The generated semantic adversarial data can also be used to robustify the detector through data augmentation, consistently improving its performance in both standard and out-of-dataset-distribution test sets, across three different datasets."



Paperid:76
Authors:AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
Title: Adversarial Generative Grammars for Human Activity Prediction
Abstract:
In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the stateof-the-art approaches, being able to predict much more accurately and further in the future, than prior work. Code will be open sourced."



Paperid:77
Authors:Ameya Prabhu, Philip H. S. Torr, Puneet K. Dokania
Title: GDumb: A Simple Approach that Questions Our Progress in Continual Learning
Abstract:
We discuss a general formulation for the Continual Learning (CL) problem for classification---a learning task where a stream provides samples to a learner and the goal of the learner, depending on the samples it receives, is to continually upgrade its knowledge about the old classes and learn new ones. Our formulation takes inspiration from the open-set recognition problem where test scenarios do not necessarily belong to the training distribution. We also discuss various quirks and assumptions encoded in recently proposed approaches for CL. We argue that some oversimplify the problem to an extent that leaves it with very little practical importance, and makes it extremely easy to perform well on. To validate this, we propose GDumb that (1) greedily stores samples in memory as they come and; (2) at test time, trains a model from scratch using samples only in the memory. We show that even though GDumb is not specifically designed for CL problems, it obtains state-of-the-art accuracies (often with large margins) in almost all the experiments when compared to a multitude of recently proposed algorithms. Surprisingly, it outperforms approaches in CL formulations for which they were specifically designed. This, we believe, raises concerns regarding our progress in CL for classification. Overall, we hope our formulation, characterizations and discussions will help in designing realistically useful CL algorithms, and GDumb will serve as a strong contender for the same."



Paperid:78
Authors:Ming Liang, Bin Yang, Rui Hu, Yun Chen, Renjie Liao, Song Feng, Raquel Urtasun
Title: Learning Lane Graph Representations for Motion Forecasting
Abstract:
We propose a motion forecasting model that exploits a novel structured map representation as well as actor-map interactions. Instead of encoding vectorized maps as raster images, we construct a lane graph from raw map data to explicitly preserve the map structure. To capture the complex topology and long range dependencies of the lane graph, we propose LaneGCN which extends graph convolutions with multiple adjacency matrices and along-lane dilation. To capture the complex interactions between actors and maps, we exploit a fusion network consisting of four types of interactions, actor-to-lane, lane-to-lane, lane-to-actor and actor-to-actor. Powered by LaneGCN and actor-map interactions, our model is able to predict accurate and realistic multi-modal trajectories. Our approach significantly outperforms the state-of-the-art on the large scale Argoverse motion forecasting benchmark."



Paperid:79
Authors:Rico Jonschkowski, Austin Stone, Jonathan T. Barron, Ariel Gordon, Kurt Konolige, Anelia Angelova
Title: What Matters in Unsupervised Optical Flow
Abstract:
We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective. Alongside this investigation we construct a number of novel improvements to unsupervised flow models, such as cost volume normalization, stopping the gradient at the occlusion mask, encouraging smoothness before upsampling the flow field, and continual self-supervision with image resizing. By combining the results of our investigation with our improved model components, we are able to present a new unsupervised flow technique that significantly outperforms the previous unsupervised state-of-the-art and performs on par with supervised FlowNet2 on the KITTI 2015 dataset, while also being significantly simpler than related approaches."



Paperid:80
Authors:Xiaowei Zhang, Christopher May, Daniel Aliaga
Title: Synthesis and Completion of Facades from Satellite Imagery
Abstract:
Automatic satellite-based reconstruction enables large and widespread creation of urban areas. However, satellite imagery is often noisy and incomplete, and is not suitable for reconstructing detailed building facades. We present a machine learning-based inverse procedural modeling method to automatically create synthetic facades from satellite imagery. Our key observation is that building facades exhibit regular, grid-like structures. Hence, we can overcome the low-resolution, noisy, and partial building data obtained from satellite imagery by synthesizing the underlying facade layout. Our method infers regular facade details from satellite-based image-fragments of a building, and applies them to occluded or under-sampled parts of the building, resulting in plausible, crisp facades. Using urban areas from six cities, we compare our approach to several state-of-the-art image completion/in-filling methods and our approach consistently creates better facade images."



Paperid:81
Authors:Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, Peter Kontschieder
Title: Mapillary Planet-Scale Depth Dataset
Abstract:
Learning-based methods produce remarkable results on single image depth tasks when trained on well-established benchmarks, however, there is a large gap from these benchmarks to real-world performance that is usually obscured by the common practice of fine-tuning on the target dataset. We introduce a new depth dataset that is an order of magnitude larger than previous offerings, but more importantly, contains an unprecedented gamut of locations, camera models and scene types while offering metric depth (not just up-to-scale). Additionally, we investigate the problem of training single image depth networks using images captured with many different cameras, validating an existing approach and proposing a simpler alternative. With our contributions we achieve excellent results on challenging benchmarks before fine-tuning, and set the state of the art on the popular KITTI dataset after fine-tuning."



Paperid:82
Authors:Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, Raquel Urtasun
Title: V2VNet: Vehicle-to-Vehicle Communication for Joint Perception and Prediction
Abstract:
In this paper, we explore the use of vehicle-to-vehicle (V2V) communication to improve the perception and motion forecasting performance of self-driving vehicles. By intelligently aggregating the information received from multiple nearby vehicles, we can observe the same scene from different viewpoints. This allows us to see through occlusions and detect actors at long range, where the observations are very sparse or non-existent. We also show that our approach of sending compressed deep feature map activations achieves high accuracy while satisfying communication bandwidth requirements."



Paperid:83
Authors:Haoyu Liang, Zhihao Ouyang, Yuyuan Zeng, Hang Su, Zihao He, Shu-Tao Xia, Jun Zhu, Bo Zhang
Title: Training Interpretable Convolutional Neural Networks by Differentiating Class-specific Filters
Abstract:
Convolutional neural networks (CNNs) have been successfully used in a range of tasks. However, CNNs are often viewed as ""black-box"" and lack of interpretability. One main reason is due to the filter-class entanglement -- an intricate many-to-many correspondence between filters and classes. Most existing works attempt post-hoc interpretation on a pre-trained model, while neglecting to reduce the entanglement underlying the model. In contrast, we focus on alleviating filter-class entanglement during training. Inspired by cellular differentiation, we propose a novel strategy to train interpretable CNNs by encouraging class-specific filters, among which each filter responds to only one (or few) class. Concretely, we design a learnable sparse Class-Specific Gate (CSG) structure to assign each filter with one (or few) class in a flexible way. The gate allows a filter's activation to pass only when the input samples come from the specific class. Extensive experiments demonstrate the fabulous performance of our method in generating a sparse and highly class-related representation of the input, which leads to stronger interpretability. Moreover, comparing with the standard training strategy, our model displays benefits in applications like object localization and adversarial sample detection. Code link: https://github.com/hyliang96/CSGCNN ."



Paperid:84
Authors:Bailin Li, Bowen Wu, Jiang Su, Guangrun Wang
Title: EagleEye: Fast Sub-net Evaluation for Efficient Neural Network Pruning
Abstract:
Finding out the computational redundant part of a trained Deep Neural Network (DNN) is the key question that pruning algorithms target on. Many algorithms try to predict model performance of the pruned sub-nets by introducing various evaluation methods. But they are either inaccurate or very complicated for general application. In this work, we present a pruning method called EagleEye, in which a simple yet efficient evaluation component based on adaptive batch normalization is applied to unveil a strong correlation between different pruned DNN structures and their final settled accuracy. This strong correlation allows us to fast spot the pruned candidates with highest potential accuracy without actually fine-tuning them. This module is also general to plug-in and improve some existing pruning algorithms. EagleEye achieves better pruning performance than all of the studied pruning algorithms in our experiments. Concretely, to prune MobileNet V1 and ResNet-50, EagleEye outperforms all compared methods by up to 3.8%. Even in the more challenging experiments of pruning the compact model of MobileNet V1, EagleEye achieves the highest accuracy of 70.9% with an overall 50% operations (FLOPs) pruned. All accuracy results are Top-1 ImageNet classification accuracy. Source code and models are accessible to open-source community.https://github.com/anonymous47823493/EagleEye"



Paperid:85
Authors:Marie-Julie Rakotosaona, Maks Ovsjanikov
Title: Intrinsic Point Cloud Interpolation via Dual Latent Space Navigation
Abstract:
We present a learning-based method for interpolating and manipulating 3D shapes represented as point clouds, that is explicitly designed to preserve intrinsic shape properties. Our approach is based on constructing a dual encoding space that enables shape synthesis and, at the same time, provides links to the intrinsic shape metric, which is typically not available on point cloud data. Our method works in a single pass and avoids expensive optimization, employed by existing techniques. Furthermore, the strong regularization provided by our dual latent space approach also helps to improve shape recovery in challenging settings from noisy point clouds across datasets. Extensive experiments show that our method results in more realistic and smoother interpolations compared to baselines."



Paperid:86
Authors:Oren Katzir, Dani Lischinski, Daniel Cohen-Or
Title: Cross-Domain Cascaded Deep Translation
Abstract:
In recent years we have witnessed tremendous progress in unpaired image-to-image translation, propelled by the emergence of DNNs and adversarial training strategies. However, most existing methods focus on transfer of style and appearance, rather than on shape translation. The latter task is challenging, due to its intricate non-local nature, which calls for additional supervision. We mitigate this by descending the deep layers of a pre-trained network, where the deep features contain more semantics, and applying the translation between these deep features. Our translation is performed in a cascaded, deep-to-shallow, fashion, along the deep feature hierarchy: we first translate between the deepest layers that encode the higher-level semantic content of the image, proceeding to translate the shallower layers, conditioned on the deeper ones. We further demonstrate the effectiveness of using pre-trained deep features in the context of unconditioned image generation. "



Paperid:87
Authors:Tatsuro Koizumi, William A. P. Smith
Title: “Look Ma, no landmarks!” – Unsupervised, Model-based Dense Face Alignment
Abstract:
no landmarks!"" - Unsupervised, model-based dense face alignment","In this paper, we show how to train an image-to-image network to predict dense correspondence between a face image and a 3D morphable model using only the model for supervision. We show that both geometric parameters (shape, pose and camera intrinsics) and photometric parameters (texture and lighting) can be inferred directly from the correspondence map using linear least squares and our novel inverse spherical harmonic lighting model. The least squares residuals provide an unsupervised training signal that allows us to avoid artefacts common in the literature such as shrinking and conservative underfitting. Our approach uses a network that is 10$ imes$ smaller than parameter regression networks, significantly reduces sensitivity to image alignment and allows known camera calibration or multi-image constraints to be incorporated during inference. We achieve results competitive with state-of-the-art but without any auxiliary supervision used by previous methods."



Paperid:88
Authors:Rémi Pautrat, Viktor Larsson, Martin R. Oswald, Marc Pollefeys
Title: Online Invariance Selection for Local Feature Descriptors
Abstract:
To be invariant, or not to be invariant: that is the question formulated in this work about local descriptors. A limitation of current feature descriptors is the trade-off between generalization and discriminative power: more invariance means less informative descriptors. We propose to overcome this limitation with a disentanglement of invariance in local descriptors and with an online selection of the most appropriate invariance given the context. Our framework consists in a joint learning of multiple local descriptors with different levels of invariance and of meta descriptors encoding the regional variations of an image. The similarity of these meta descriptors across images is used to select the right invariance when matching the local descriptors. Our approach, named Local Invariance Selection at Runtime for Descriptors (LISRD), enables descriptors to adapt to adverse changes in images, while remaining discriminative when invariance is not required. We demonstrate that our method can boost the performance of current descriptors and outperforms state-of-the-art descriptors in several matching tasks, when evaluated on challenging datasets with day-night illumination as well as viewpoint changes."



Paperid:89
Authors:Hongyu Liu, Bin Jiang, Yibing Song, Wei Huang, Chao Yang
Title: Rethinking Image Inpainting via a Mutual Encoder-Decoder with Feature Equalizations
Abstract:
Deep encoder-decoder based CNNs have advanced image inpainting methods for hole filling. While existing methods recover structures and textures step-by-step in the hole regions, they typically use two encoder-decoders for separate recovery. The CNN features of each encoder are learned to capture either missing structures or textures without considering them as a whole. The insufficient utilization of these encoder features hampers the performance of recovering both structures and textures. In this paper, we propose a mutual encoder-decoder CNN for joint recovery of both. We use CNN features from the deep and shallow layers of the encoder to represent structures and textures of an input image, respectively. The deep layer features are sent to a structure branch, while the shallow layer features are sent to a texture branch. In each branch, we fill holes in multiple scales of the CNN features. The filled CNN features from both branches are concatenated and then equalized. During feature equalization, we reweigh channel attentions first and propose a bilateral propagation activation function to enable spatial equalization. To this end, the filled CNN features of structure and texture mutually benefit each other to represent image content at all feature levels. We then use the equalized feature to supplement decoder features for output image generation through skip connections. Experiments on benchmark datasets show that the proposed method is effective to recover structures and textures and performs favorably against state-of-the-art approaches."



Paperid:90
Authors:Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Title: TextCaps: a Dataset for Image Captioning with Reading Comprehension
Abstract:
Image descriptions can help visually impaired people to quickly understand the image content. While we made significant progress in automatically describing images and optical character recognition, current approaches are unable to include written text in their descriptions, although text is omnipresent in human environments and frequently critical to understand our surroundings. To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images. Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase, requiring spatial, semantic, and visual reasoning between multiple text tokens and visual entities, such as objects. We study baselines and adapt existing approaches to this new task, which we refer to as image captioning with reading comprehension. Our analysis with automatic and human studies shows that our new TextCaps dataset provides many new technical challenges over previous datasets."



Paperid:91
Authors:Karttikeya Mangalam, Harshayu Girase, Shreyas Agarwal, Kuan-Hui Lee, Ehsan Adeli, Jitendra Malik, Adrien Gaidon
Title: It is not the Journey but the Destination: Endpoint Conditioned Trajectory Prediction
Abstract:
Human trajectory forecasting with multiple socially interact-ing agents is of critical importance for autonomous navigation in human environments, e.g., for self-driving cars and social robots. In this work, we present Predicted Endpoint Conditioned Network (PECNet) for flexible human trajectory prediction. PECNet infers distant trajectory endpoints to assist in long-range multi-modal trajectory prediction. A novel non-local social pooling layer enables PECNet to infer diverse yet socially compliant trajectories. Additionally, we present a simple “truncation-trick” for improving few-shot multi-modal trajectory prediction performance. We show that PECNet improves state-of-the-art performance on the Stanford Drone trajectory prediction benchmark by ~20.9% and on the ETH/UCY benchmark by ∼40.8%. Code available at the home page: https://karttikeya.github.io/publication/htf/"



Paperid:92
Authors:Goutam Bhat, Felix Järemo Lawin, Martin Danelljan, Andreas Robinson, Michael Felsberg, Luc Van Gool, Radu Timofte
Title: Learning What to Learn for Video Object Segmentation
Abstract:
Video object segmentation (VOS) is a highly challenging problem, since the target object is only defined by a first-frame reference mask during inference. The problem of how to capture and utilize this limited information to accurately segment the target remains a fundamental research question. We address this by introducing an end-to-end trainable VOS architecture that integrates a differentiable few-shot learner. Our learner is designed to predict a powerful parametric model of the target by minimizing a segmentation error in the first frame. We further go beyond the standard few-shot learning paradigm by learning what our target model should learn in order to maximize segmentation accuracy. We perform extensive experiments on standard benchmarks. Our approach sets a new state-of-the-art on the large-scale YouTube-VOS 2018 dataset by achieving an overall score of 81.5, corresponding to a 2.6% relative improvement over the previous best result. The code and models are available at https://github.com/visionml/pytracking."



Paperid:93
Authors:Garvita Tiwari, Bharat Lal Bhatnagar, Tony Tung, Gerard Pons-Moll
Title: SIZER: A Dataset and Model for Parsing 3D Clothing and Learning Size Sensitive 3D Clothing
Abstract:
While models of 3D clothing learned from real data exist, no method can predict clothing deformation as a function of garment size. In this paper, we introduce SizerNet to predict 3D clothing conditioned on human body shape and garment size parameters, and ParserNet to infer garment meshes and shape under clothing with personal details in a single pass from an input mesh. SizerNet allows to estimate and visualize the dressing effect of a garment in various sizes, and ParserNet allows to edit clothing of an input mesh directly, removing the need for scan segmentation, which is a challenging problem in itself. To learn these models, we introduce the SIZER dataset of clothing size variation which includes 100 different subjects wearing casual clothing items in various sizes, totaling to approximately 2000 scans. This dataset includes the scans, registrations to the SMPL model, scans segmented in clothing parts, garment category and size labels. Our experiments show better parsing accuracy and size prediction than baseline methods trained on SIZER. The code, model and dataset will be released for research purposes at: https://virtualhumans.mpi-inf.mpg.de/sizer/."



Paperid:94
Authors:Luca Cosmo, Antonio Norelli, Oshri Halimi, Ron Kimmel, Emanuele Rodolà
Title: LIMP: Learning Latent Shape Representations with Metric Preservation Priors
Abstract:
In this paper, we advocate the adoption of metric preservation as a powerful prior for learning latent representations of deformable 3D shapes. Key to our construction is the introduction of a geometric distortion criterion, defined directly on the decoded shapes, translating the preservation of the metric on the decoding to the formation of linear paths in the underlying latent space. Our rationale lies in the observation that training samples alone are often insufficient to endow generative models with high fidelity, motivating the need for large training datasets. In contrast, metric preservation provides a rigorous way to control the amount of geometric distortion incurring in the construction of the latent space, leading in turn to synthetic samples of higher quality. We further demonstrate, for the first time, the adoption of differentiable intrinsic distances in the backpropagation of a geodesic loss. Our geometric priors are particularly relevant in the presence of scarce training data, where learning any meaningful latent structure can be especially challenging. The effectiveness and potential of our generative model is showcased in applications of style transfer and content generation using triangle meshes and point clouds."



Paperid:95
Authors:Runtao Liu, Qian Yu, Stella X. Yu
Title: Unsupervised Sketch to Photo Synthesis
Abstract:
Humans can envision a realistic photo given a free-hand sketch that is not only spatially imprecise and geometrically distorted but also without colors and visual details. We study unsupervised sketch to photo synthesis for the first time, learning from unpaired sketch and photo data where the target photo for a sketch is unknown during training. Existing works only deal with either style difference or spatial deformation alone, synthesizing photos from edge-aligned line drawings or transforming shapes within the same modality, e.g., color images. Our insight is to decompose the unsupervised sketch to photo synthesis task into two stages of translation: First shape translation from sketches to grayscale photos and then content enrichment from grayscale to color photos. We also incorporate a self-supervised denoising objective and an attention module to handle abstraction and style variations that are specific to sketches. Our synthesis is sketch-faithful and photo-realistic, enabling sketch-based image retrieval and automatic sketch generation that captures human visual perception beyond the edge map of a photo."



Paperid:96
Authors:Evgenia Rusak, Lukas Schott, Roland S. Zimmermann, Julian Bitterwolf , Oliver Bringmann, Matthias Bethge, Wieland Brendel
Title: A Simple Way to Make Neural Networks Robust Against Diverse Image Corruptions
Abstract:
The human visual system is remarkably robust against a wide range of naturally occurring variations and corruptions like rain or snow. In contrast, the performance of modern image recognition models strongly degrades when evaluated on previously unseen corruptions. Here, we demonstrate that a simple but properly tuned training with additive Gaussian and Speckle noise generalizes surprisingly well to unseen corruptions, easily reaching the state of the art on the corruption benchmark ImageNet-C (with ResNet50) and on MNIST-C. We build on top of these strong baseline results and show that an adversarial training of the recognition model against locally correlated worst-case noise distributions leads to an additional increase in performance. This regularization can be combined with previously proposed defense methods for further improvement."



Paperid:97
Authors:Yida Wang, David Joseph Tan, Nassir Navab, Federico Tombari
Title: SoftPoolNet: Shape Descriptor for Point Cloud Completion and Classification
Abstract:
Point clouds are often the default choice for many applications as they exhibit more flexibility and efficiency than volumetric data. Nevertheless, their unorganized nature - points are stored in an unordered way - makes them less suited to be processed by deep learning pipelines. In this paper, we propose a method for 3D object completion and classification based on point clouds. We introduce a new way of organizing the extracted features based on their activations, which we name soft pooling. For the decoder stage, we propose regional convolutions, a novel operator aimed at maximizing the global activation entropy. Furthermore, inspired by the local refining procedure in Point Completion Network (PCN), we also propose a patch-deforming operation to simulate deconvolutional operations for point clouds. This paper proves that our regional activation can be incorporated in many point cloud architectures like AtlasNet and PCN, leading to better performance for geometric completion. We evaluate our approach on different 3D tasks such as object completion and classification, achieving state-of-the-art accuracy."



Paperid:98
Authors:Peipei Li, Huaibo Huang, Yibo Hu, Xiang Wu, Ran He, Zhenan Sun
Title: Hierarchical Face Aging through Disentangled Latent Characteristics
Abstract:
Current age datasets lie in a long-tailed distribution, which brings difficulties to describe the aging mechanism for the imbalance ages. To alleviate it, we design a novel facial age prior to guide the aging mechanism modeling. To explore the age effects on facial images, we propose a Disentangled Adversarial Autoencoder (DAAE) to disentangle the facial images into three independent factors: age, identity and extraneous information. To avoid the ""wash away"" of age and identity information in face aging process, we propose a hierarchical conditional generator by passing the disentangled identity and age embeddings to the high-level and low-level layers with class-conditional BatchNorm. Finally, a disentangled adversarial learning mechanism is introduced to boost the image quality for face aging. In this way, when manipulating the age distribution, DAAE can achieve face aging with arbitrary ages. Further, given an input face image, the mean value of the learned age posterior distribution can be treated as an age estimator. These indicate that DAAE can efficiently and accurately estimate the age distribution by a disentangling manner. DAAE is the first attempt to achieve facial age analysis tasks, including face aging with arbitrary ages, exemplar-based face aging and age estimation, in an universal framework. The qualitative and quantitative experiments demonstrate the superiority of DAAE on five popular datasets, including CACD2000, Morph, UTKFace, FG-NET and AgeDB."



Paperid:99
Authors:Hongjie Zhang, Ang Li, Jie Guo, Yanwen Guo
Title: Hybrid Models for Open Set Recognition
Abstract:
Open set recognition requires a classifier to detect samples not belonging to any of the classes in its training set. Existing methods fit a probability distribution to the training samples on their embedding space and detect outliers according to this distribution. The embedding space is often obtained from a discriminative classifier. However, such discriminative representation focuses only on known classes, which may not be critical for distinguishing the unknown classes. We argue that the representation space should be jointly learned from the inlier classifier and the density estimator (served as an outlier detector). We propose the OpenHybrid framework, which is composed of an encoder to encode the input data into a joint embedding space, a classifier to classify samples to inlier classes, and a flow-based density estimator to detect whether a sample belongs to the unknown category. A typical problem of existing flow-based models is that they may assign a higher likelihood to outliers. However, we empirically observe that such an issue does not occur in our experiments when learning a joint representation for discriminative and generative components. Experiments on standard open set benchmarks also reveal that an end-to-end trained OpenHybrid model significantly outperforms state-of-the-art methods and flow-based baselines."



Paperid:100
Authors:Fan Wang, Huidong Liu, Dimitris Samaras, Chao Chen
Title: TopoGAN: A Topology-Aware Generative Adversarial Network
Abstract:
Existing generative adversarial networks (GANs) focus on generating realistic images based on CNN-derived image features, but fail to preserve the structural properties of real images. This can be fatal in applications where the underlying structure (e.g., neurons, vessels, membranes, and road networks) of the image carries crucial semantic meaning. In this paper, we propose a novel GAN model that learns the topology of real images, i.e., connectedness and loopy-ness. In particular, we introduce a new loss that bridges the gap between synthetic image distribution and real image distribution in the topological feature space. By optimizing this loss, the generator produces images with the same structural topology as real images. We also propose new GAN evaluation metrics that measure the topological realism of the synthetic images. We show in experiments that our method generates synthetic images with realistic topology. We also highlight the increased performance that our method brings to downstream tasks such as segmentation."



Paperid:101
Authors:Fuchen Long, Ting Yao, Zhaofan Qiu, Xinmei Tian, Jiebo Luo, Tao Mei
Title: Learning to Localize Actions from Moments
Abstract:
With the knowledge of action moments (i.e., trimmed video clips that each contains an action instance), humans could routinely localize an action temporally in an untrimmed video. Nevertheless, most practical methods still require all training videos to be labeled with temporal annotations (action category and temporal boundary) and develop the models in a fully-supervised manner, despite expensive labeling efforts and inapplicable to new categories. In this paper, we introduce a new design of transfer learning type to learn action localization for a large set of action categories, but only on action moments from the categories of interest and temporal annotations of untrimmed videos from a small set of action classes. Specifically, we present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework. Technically, a weight transfer function is uniquely devised to build the transformation between classification of action moments or foreground video segments and action localization in synthetic contextual moments or untrimmed videos. The context of each moment is learnt through the adversarial mechanism to differentiate the generated features from those of background in untrimmed videos. Extensive experiments are conducted on the learning both across the splits of ActivityNet v1.3 and from THUMOS14 to ActivityNet v1.3. Our AherNet demonstrates the superiority even comparing to most fully-supervised action localization methods. More remarkably, we train AherNet to localize actions from 600 categories on the leverage of action moments in Kinetics-600 and temporal annotations from 200 classes in ActivityNet v1.3."



Paperid:102
Authors:Ziqiang Zheng, Yang Wu, Xinran Han, Jianbo Shi
Title: ForkGAN: Seeing into the Rainy Night
Abstract:
We present a ForkGAN for task-agnostic image translation that can boost multiple vision tasks in adverse weather conditions. Three tasks of image localization/retrieval, semantic image segmentation, and object detection are evaluated. The key challenge is achieving high-quality image translation without any explicit supervision, or task awareness. Our innovation is a fork-shape generator with one encoder and two decoders that disentangles the domain-specific and domain-invariant information. We force the cyclic translation between the weather conditions to go through a common encoding space, and make sure the encoding features reveal no information about the domains. Experimental results show our algorithm produces state-of-the-art image synthesis results and boost three vision tasks' performances in adverse weathers."



Paperid:103
Authors:Xinwei Sun, Yilun Xu, Peng Cao, Yuqing Kong, Lingjing Hu, Shanghang Zhang, Yizhou Wang
Title: TCGM: An Information-Theoretic Framework for Semi-Supervised Multi-Modality Learning
Abstract:
Fusing data from multiple modalities provides more information to train machine learning systems. However, it is prohibitively expensive and time-consuming to label each modality with a large amount of data, which leads to a crucial problem of such semi-supervised multi-modal learning. Existing methods suffer from either ineffective fusion across modalities or lack of theoretical results under proper assumptions. In this paper, we propose a novel information-theoretic approach \-- namely, extbf{T}otal extbf{C}orrelation extbf{G}ain extbf{M}aximization (TCGM) \--- for semi-supervised multi-modal learning, which is endowed with promising properties: (i) it can utilize effectively the information across different modalities of unlabeled data points to facilitate training classifiers of each modality (ii) has theoretical guarantee to have theoretical guarantee to identify Bayesian classifiers, i.e., the ground truth posteriors of all modalities. Specifically, by maximizing TC-induced loss (namely TC gain) over classifiers of all modalities, these classifiers can cooperatively discover the equivalent class of ground-truth classifiers; and identify the unique ones by leveraging a limited percentage of labeled data. We apply our method and can achieve state-of-the-art results on various datasets, including the Newsgroup dataset, Emotion recognition (IEMOCAP and MOSI) and Medical Imaging (Alzheimer’s Disease Neuroimaging Initiative)."



Paperid:104
Authors:Quan Cui, Qing-Yuan Jiang, Xiu-Shen Wei, Wu-Jun Li, Osamu Yoshie
Title: ExchNet: A Unified Hashing Network for Large-Scale Fine-Grained Image Retrieval
Abstract:
Retrieving content relevant images from a large-scale fine-grained dataset could suffer from intolerably slow query speed and highly redundant storage cost, due to high-dimensional real-valued embeddings which aim to distinguish subtle visual differences of fine-grained objects. In this paper, we study the novel fine-grained hashing topic to generate compact binary codes for fine-grained images, leveraging the search and storage efficiency of hash learning to alleviate the aforementioned problems. Specifically, we propose a unified end-to-end trainable network, termed as ExchNet. Based on attention mechanisms and proposed attention constraints, it can firstly obtain both local and global features to represent object parts and whole fine-grained objects, respectively. Furthermore, to ensure the discriminative ability and semantic meaning's consistency of these part-level features across images, we design a local feature alignment approach by performing a feature exchanging operation. Later, an alternative learning algorithm is employed to optimize the whole ExchNet and then generate the final binary hash codes. Validated by extensive experiments, our ExchNet consistently outperforms state-of-the-art generic hashing methods on five fine-grained datasets, which shows our effectiveness. Moreover, compared with other approximate nearest neighbor methods, ExchNet achieves the best speed-up and storage reduction, revealing its efficiency and practicality."



Paperid:105
Authors:Liming Jiang, Changxu Zhang, Mingyang Huang, Chunxiao Liu, Jianping Shi, Chen Change Loy
Title: TSIT: A Simple and Versatile Framework for Image-to-Image Translation
Abstract:
We introduce a simple and versatile framework for image-to-image translation. We unearth the importance of normalization layers, and provide a carefully designed two-stream generative model with newly proposed feature transformations in a coarse-to-fine fashion. This allows multi-scale semantic structure information and style representation to be effectively captured and fused by the network, permitting our method to scale to various tasks in both unsupervised and supervised settings. No additional constraints (e.g., cycle consistency) are needed, contributing to a very clean and simple method. Multi-modal image synthesis with arbitrary style control is made possible. A systematic study compares the proposed method with several state-of-the-art task-specific baselines, verifying its effectiveness in both perceptual quality and quantitative evaluations."



Paperid:106
Authors:Xiangyu He, Zitao Mo, Ke Cheng, Weixiang Xu, Qinghao Hu, Peisong Wang, Qingshan Liu, Jian Cheng
Title: ProxyBNN: Learning Binarized Neural Networks via Proxy Matrices
Abstract:
Training Binarized Neural Networks (BNNs) is challenging due to the discreteness. In order to efficiently optimize BNNs through backward propagations, real-valued auxiliary variables are commonly used to accumulate gradient updates. Those auxiliary variables are then directly quantized to binary weights in the forward pass, which brings about large quantization errors. In this paper, by introducing an appropriate proxy matrix, we reduce the weights quantization error while circumventing explicit binary regularizations on the full-precision auxiliary variables. Specifically, we regard pre-binarization weights as a linear combination of the basis vectors. The matrix composed of basis vectors is referred to as the proxy matrix, and auxiliary variables serve as the coefficients of this linear combination. We are the first to empirically identify and study the effectiveness of learning both basis and coefficients to construct the pre-binarization weights. This new proxy learning contributes to new leading performances on benchmark datasets."



Paperid:107
Authors:Can Wang, Jiefeng Li, Wentao Liu, Chen Qian, Cewu Lu
Title: HMOR: Hierarchical Multi-Person Ordinal Relations for Monocular Multi-Person 3D Pose Estimation
Abstract:
Remarkable progress has been made in 3D human pose estimation from a monocular RGB camera. However, only a few studies explored 3D multi-person cases. In this paper, we attempt to address the lack of a global perspective of the top-down approaches by introducing a novel form of supervision - Hierarchical Multi-person Ordinal Relations (HMOR). The HMOR encodes interaction information as the ordinal relations of depths and angles hierarchically, which captures the body-part and joint level semantic and maintains global consistency at the same time. In our approach, an integrated top-down model is designed to leverage these ordinal relations in the learning process. The integrated model estimates human bounding boxes, human depths, and root-relative 3D poses simultaneously, with a coarse-to-fine architecture to improve the accuracy of depth estimation. The proposed method significantly outperforms state-of-the-art methods on publicly available multi-person 3D pose datasets (9.2 mm improvement on 3DPW dataset, 12.3 PCK improvement on MuPoTS-3D dataset, and 20.5 mm improvement on CMU Panoptic dataset). In addition to superior performance, our method costs lower computation complexity and fewer model parameters. Our code will be made publicly available."



Paperid:108
Authors:Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
Title: Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve
Abstract:
Object recognition has seen significant progress in the image domain, with focus primarily on 2D perception. We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses. We present Mask2CAD, which jointly detects objects in real-world images and for each detected object, optimizes for the most similar CAD model and its pose.We construct a joint embedding space between the detected regions of an image corresponding to an object and 3D CAD models, enabling retrieval of CAD models for an input RGB image. This produces a clean, lightweight representation of the objects in an image; this CAD-based representation ensures a valid, efficient shape representation for applications such as content creation or interactive scenarios, and makes a step towards understanding the transformation of real-world imagery to a synthetic domain.Experiments on real-world images from Pix3D demonstrate the advantage of our approach in comparison to state of the art. To facilitate future research, we additionally propose a new image-to-3D baseline on ScanNet which features larger shape diversity, real-world occlusions, and challenging image views. "



Paperid:109
Authors:Lanlan Liu, Mingzhe Wang, Jia Deng
Title: A Unified Framework of Surrogate Loss by Refactoring and Interpolation
Abstract:
We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent, reducing the amount of manual design of task-specific surrogate losses. Our key observation is that in many cases, evaluating a model with a performance metric on a batch of examples can be refactored into four steps: from input to real-valued scores, from scores to comparisons of pairs of scores, from comparisons to binary variables, and from binary variables to the final performance metric. Using this refactoring we generate differentiable approximations for each non-differentiable step through interpolation. Using UniLoss, we can optimize for different tasks and metrics using one unified framework, achieving comparable performance compared with task-specific losses. We validate the effectiveness of UniLoss on three tasks and four datasets. Code is available at https://github.com/princeton-vl/uniloss."



Paperid:110
Authors:Sai Bi, Zexiang Xu, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, Ravi Ramamoorthi
Title: Deep Reflectance Volumes: Relightable Reconstructions from Multi-View Photometric Images
Abstract:
We present a deep learning approach to reconstruct scene appearance from unstructured images captured under collocated point lighting. At the heart of Deep Reflectance Volumes is a novel volumetric scene representation consisting of opacity, surface normal and reflectance voxel grids. We present a novel physically-based differentiable volume ray marching framework to render these scene volumes under arbitrary viewpoint and lighting. This allows us to optimize the scene volumes to minimize the error between their rendered images and the captured images. Our method is able to reconstruct real scenes with challenging non-Lambertian reflectance and complex geometry with occlusions and shadowing. Moreover, it accurately generalizes to novel viewpoints and lighting, including non-collocated lighting, rendering photorealistic images that are significantly better than state-of-the-art mesh-based methods. We also show that our learned reflectance volumes are editable, allowing for modifying the materials of the captured scenes."



Paperid:111
Authors:Tengda Han, Weidi Xie, Andrew Zisserman
Title: Memory-augmented Dense Predictive Coding for Video Representation Learning
Abstract:
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently.(ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data."



Paperid:112
Authors:Yunlu Chen, Vincent Tao Hu, Efstratios Gavves, Thomas Mensink, Pascal Mettes, Pengwan Yang, Cees G. M. Snoek
Title: PointMixup: Augmentation for Point Clouds
Abstract:
This paper introduces data augmentation for point clouds by interpolation between examples. Data augmentation by interpolation has shown to be a simple and effective approach in the image domain. Such a mixup is however not directly transferable to point clouds, as we do not have a one-to-one correspondence between the points of two different objects. In this paper, we define data augmentation between point clouds as a shortest path linear interpolation. To that end, we introduce PointMixup, an interpolation method that generates new examples through an optimal assignment of the path function between two point clouds. We prove that our PointMixup finds the shortest path between two point clouds and that the interpolation is assignment invariant and linear. With the definition of interpolation, PointMixup allows to introduce strong interpolation-based regularizers such as mixup and manifold mixup to the point cloud domain. Experimentally, we show the potential of PointMixup for point cloud classification, especially when examples are scarce, as well as increased robustness to noise and geometric transformations to points. The code for PointMixup and the experimental details are publicly available."



Paperid:113
Authors:Kuan Zhu, Haiyun Guo, Zhiwei Liu, Ming Tang, Jinqiao Wang
Title: Identity-Guided Human Semantic Parsing for Person Re-Identification
Abstract:
Existing alignment-based methods have to employ the pre-trained human parsing models to achieve the pixel-level alignment, and cannot identify the personal belongings (e.g., backpacks and reticule) which are crucial to person re-ID. In this paper, we propose the identity-guided human semantic parsing approach (ISP) to locate both the human body parts and personal belongings at pixel-level for aligned person re-ID only with person identity labels. We design the cascaded clustering on feature maps to generate the pseudo-labels of human parts. Specifically, for the pixels of all images of a person, we first group them to foreground or background and then group the foreground pixels to human parts. The cluster assignments are subsequently used as pseudo-labels of human parts to supervise the part estimation and ISP iteratively learns the feature maps and groups them. Finally, local features of both human body parts and personal belongings are obtained according to the self-learned part estimation, and only features of visible parts are utilized for the retrieval. Extensive experiments on three widely used datasets validate the superiority of ISP over lots of state-of-the-art methods."



Paperid:114
Authors:Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge Belongie, Noah Snavely, Bharath Hariharan
Title: Learning Gradient Fields for Shape Generation
Abstract:
In this work, we propose a novel technique to generate shapes from point cloud data. A point cloud can be viewed as samples from a distribution of 3D points whose density is concentrated near the surface of the shape. Point cloud generation thus amounts to moving randomly sampled points to high-density areas. We generate point clouds by performing stochastic gradient ascent on an unnormalized probability density, thereby moving sampled points toward the high-likelihood regions. Our model directly predicts the gradient of the log density field and can be trained with a simple objective adapted from score-based generative models. We show that our method can reach state-of-the-art performance for point cloud auto-encoding and generation, while also allowing for extraction of a high-quality implicit surface."



Paperid:115
Authors:Kuniaki Saito, Kate Saenko, Ming-Yu Liu
Title: COCO-FUNIT: Few-Shot Unsupervised Image Translation with a Content Conditioned Style Encoder
Abstract:
Unsupervised image-to-image translation intends to learn a mapping of an image in a given domain to an analogous image in a different domain, without explicit supervision of the mapping. Few-shot unsupervised image-to-image translation further attempts to generalize the model to an unseen domain by leveraging example images of the unseen domain provided at inference time. While remarkably successful, existing few-shot image-to-image translation models find it difficult to preserve the structure of the input image while emulating the appearance of the unseen domain, which we refer to as the extit{content loss} problem. This is particularly severe when the poses of the objects in the input and example images are very different. To address the issue, we propose a new few-shot image translation model, COCO-FUNIT, which computes the style embedding of the example images conditioned on the input image and a new module called the constant style bias. Through extensive experimental validations with comparison to the state-of-the-art, our model shows effectiveness in addressing the extit{content loss} problem. Code and pretrained models are available at \url{https://nvlabs.github.io/COCO-FUNIT/}."



Paperid:116
Authors:Kaiwen Duan, Lingxi Xie, Honggang Qi, Song Bai, Qingming Huang, Qi Tian
Title: Corner Proposal Network for Anchor-free, Two-stage Object Detection
Abstract:
Two-stage Object Detection","The goal of object detection is to determine the class and location of objects in an image. This paper proposes a novel anchor-free, two-stage framework which first extracts a number of object proposals by finding potential corner keypoint combinations and then assigns a class label to each proposal by a standalone classification stage. We demonstrate that these two stages are effective solutions for improving recall and precision, respectively, and they can be integrated into an end-to-end network. Our approach, dubbed Corner Proposal Network (CPN), enjoys the ability to detect objects of various scales and also avoids being confused by a large number of false-positive proposals. On the MS-COCO dataset, CPN achieves an AP of 49.2% which is competitive among state-of-the-art object detection methods. CPN also fits the scenario of computational efficiency, which achieves an AP of 41.6%/39.7% at 26.2/43.3 FPS, surpassing most competitors with the same inference speed. Code is available at https://github.com/Duankaiwen/CPNDet"



Paperid:117
Authors:Henghui Ding, Scott Cohen, Brian Price, Xudong Jiang
Title: PhraseClick: Toward Achieving Flexible Interactive Segmentation by Phrase and Click
Abstract:
Existing interactive object segmentation methods mainly take spatial interactions such as bounding boxes or clicks as input. However, these interactions do not contain information about explicit attributes of the target-of-interest and thus cannot quickly specify what the selected object exactly is, especially when there are diverse scales of candidate objects or the target-of-interest contains multiple objects. Therefore, excessive user interactions are often required to reach desirable results. On the other hand, in existing approaches attribute information of objects is often not well utilized in interactive segmentation. We propose to employ phrase expressions as another interaction input to infer the attributes of target object. In this way, we can 1) leverage spatial clicks to locate the target object and 2) utilize semantic phrases to qualify the attributes of the target object. Specifically, the phrase expressions focus on ``what"" the target object is and the spatial clicks are in charge of ``where"" the target object is, which together help to accurately segment the target-of-interest with smaller number of interactions. Moreover, the proposed approach is flexible in terms of interaction modes and can efficiently handle complex scenarios by leveraging the strengths of each type of input. Our multi-modal phrase+click approach achieves new state-of-the-art performance on interactive segmentation. To the best of our knowledge, this is the first work to leverage both clicks and phrases for interactive segmentation."



Paperid:118
Authors:Yapeng Tian, Dingzeyu Li, Chenliang Xu
Title: Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing
Abstract:
In this paper, we introduce a new problem, named audio-visual video parsing, which aims to parse a video into temporal event segments and label them as either audible, visible, or both. Such a problem is essential for a complete understanding of the scene depicted inside a video. To facilitate exploration, we collect a Look, Listen, and Parse (LLP) dataset to investigate audio-visual video parsing in a weakly-supervised manner. This task can be naturally formulated as a Multimodal Multiple Instance Learning (MMIL) problem. Concretely, we propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously. We develop an attentive MMIL pooling method to adaptively explore useful audio and visual content from different temporal extent and modalities. Furthermore, we discover and mitigate modality bias and noisy label issues with an individual-guided learning mechanism and label smoothing technique, respectively. Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels. Our proposed framework can effectively leverage unimodal and cross-modal temporal contexts and alleviate modality bias and noisy labels problems. "



Paperid:119
Authors:Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xiangyu Zhang, Xinyu Zhou, Erjin Zhou, Jian Sun
Title: Learning Delicate Local Representations for Multi-Person Pose Estimation
Abstract:
In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatial size (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial information and result in precise keypoint localization. Additionally, we observe the output features contribute differently to final performance. To tackle this problem, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to make a trade-off between local and global representations in output features and further refine the keypoint locations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at https://github.com/caiyuanhao1998/RSN/"



Paperid:120
Authors:Edward Beeching, Jilles Dibangoye, Olivier Simonin, Christian Wolf
Title: Learning to Plan with Uncertain Topological Maps
Abstract:
We train an agent to navigate in 3D environments using a hierarchical strategy including a high-level graph based planner and a local policy. Our main contribution is a data driven learning based approach for planning under uncertainty in topological maps, requiring an estimate of shortest paths in valued graphs with a probabilistic structure. Whereas classical symbolic algorithms achieve optimal results on noise-less topologies, or optimal results in a probabilistic sense on graphs with probabilistic structure, we aim to show that machine learning can overcome missing information in the graph by taking into account rich high-dimensional node features, for instance visual information available at each location of the map. Compared to purely learned neural white box algorithms, we structure our neural model with an inductive bias for dynamic programming based shortest path algorithms, and we show that a particular parameterization of our neural model corresponds to the Bellman-Ford algorithm. By performing an empirical analysis of our method in simulated photo-realistic 3D environments, we demonstrate that the inclusion of visual features in the learned neural planner outperforms classical symbolic solutions for graph based planning. "



Paperid:121
Authors:Hsin-Ying Lee, Lu Jiang, Irfan Essa, Phuong B Le, Haifeng Gong, Ming-Hsuan Yang, Weilong Yang
Title: Neural Design Network: Graphic Layout Generation with Constraints
Abstract:
Graphic design is essential for visual communication with layouts being fundamental to composing attractive designs. Layout generation differs from pixel-level image synthesis and is unique in terms of the requirement of mutual relations among the desired components. We propose a method for design layout generation that can satisfy user-specified constraints. The proposed neural design network (NDN) consists of three modules. The first module predicts a graph with complete relations from a graph with user-specified relations. The second module generates a layout from the predicted graph. Finally, the third module fine-tunes the predicted layout. Quantitative and qualitative experiments demonstrate that the generated layouts are visually similar to real design layouts. We also construct real designs based on predicted layouts for a better understanding of the visual quality. Finally, we demonstrate a practical application on layout recommendation."



Paperid:122
Authors:Guangyao Chen, Limeng Qiao, Yemin Shi, Peixi Peng, Jia Li, Tiejun Huang, Shiliang Pu, Yonghong Tian
Title: Learning Open Set Network with Discriminative Reciprocal Points
Abstract:
Open set recognition is an emerging research area that aims to simultaneously classify samples from predefined classes and identify the rest as 'unknown'. In this process, one of the key challenges is to reduce the risk of generalizing the inherent characteristics of numerous unknown samples learned from a small amount of known data. In this paper, we propose a new concept, Reciprocal Point, which is the potential representation of the extra-class space corresponding to each known category. The sample can be classified to known or unknown by the otherness with reciprocal points. To tackle the open set problem, we offer a novel open space risk regularization term. Based on the bounded space constructed by reciprocal points, the risk of unknown is reduced through multi-category interaction. The novel learning framework called Reciprocal Point Learning (RPL), which can indirectly introduce the unknown information into the learner with only known classes, so as to learn more compact and discriminative representations. Moreover, we further construct a new large-scale challenging aircraft dataset for open set recognition: Aircraft 300 (Air-300). Extensive experiments on multiple benchmark datasets indicate that our framework is significantly superior to other existing approaches and achieves state-of-the-art performance on standard open set benchmarks."



Paperid:123
Authors:Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, Andreas Geiger
Title: Convolutional Occupancy Networks
Abstract:
Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables the fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes, and generalizes well from synthetic to real data."



Paperid:124
Authors:He Chen, Pengfei Guo, Pengfei Li, Gim Hee Lee, Gregory Chirikjian
Title: Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View Geometry
Abstract:
Epipolar constraints are at the core of feature matching and depth estimation in current multi-person multi-camera 3D human pose estimation methods. Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances mainly due to two sources of ambiguity. The first is the mismatch of human joints resulting from the simple cues provided by the Euclidean distances between joints and epipolar lines. The second is the lack of robustness from the naive formulation of the problem as a least squares minimization. In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation. Our method consists of two key components: a graph model for fast cross-view matching, and a maximum a posteriori (MAP) estimator for the reconstruction of the 3D human poses. We demonstrate the effectiveness and superiority of our proposed method on four benchmark datasets."



Paperid:125
Authors:Daniel Bolya, Sean Foley, James Hays, Judy Hoffman
Title: TIDE: A General Toolbox for Identifying Object Detection Errors
Abstract:
We introduce TIDE, a framework and associated toolbox for analyzing the sources of error in object detection and instance segmentation algorithms. Importantly, our framework is applicable across datasets and can be applied directly to output prediction files without required knowledge of the underlying prediction system. Thus, our framework can be used as a drop-in replacement for the standard mAP computation while providing a comprehensive analysis of each model's strengths and weaknesses. We segment errors into six types and, crucially, are the first to introduce a technique for measuring the contribution of each error in a way that isolates its effect on overall performance. We show that such a representation is critical for drawing accurate, comprehensive conclusions through in-depth analysis across 4 datasets and 7 recognition models."



Paperid:126
Authors:Saining Xie, Jiatao Gu, Demi Guo, Charles R. Qi, Leonidas Guibas, Or Litany
Title: PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
Abstract:
Arguably one of the top success stories of deep learning is transfer learning. The finding that pre-training a network on a rich source set (g, ImageNet) can help boost performance once fine-tuned on a usually much smaller target set, has been instrumental to many applications in language and vision. Yet, very little is known about its usefulness in 3D point cloud understanding. We see this as an opportunity considering the effort required for annotating data in 3D. In this work, we aim at facilitating research on 3D representation learning. Different from previous works, we focus on high-level scene understanding tasks. To this end, we select a suit of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes. Our findings are extremely encouraging: using a unified triplet of architecture, source dataset, and contrastive loss for pre-training, we achieve improvement over recent best results in segmentation and detection across 6 different benchmarks for indoor and outdoor, real and synthetic datasets -- demonstrating that the learned representation can generalize across domains. Furthermore, the improvement was similar to supervised pre-training, suggesting that future efforts should favor scaling data collection over more detailed annotation. We hope these findings will encourage more research on unsupervised pretext task design for 3D deep learning."



Paperid:127
Authors:Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, Huazhong Yang
Title: DSA: More Efficient Budgeted Pruning via Differentiable Sparsity Allocation
Abstract:
Budgeted pruning is the problem of pruning under resource constraints. In budgeted pruning, how to distribute the resources across layers (i.e., sparsity allocation) is the key problem. Traditional methods solve it by discretely searching for the layer-wise pruning ratios, which lacks efficiency. In this paper, we propose Differentiable Sparsity Allocation (DSA), an efficient end-to-end budgeted pruning flow. Utilizing a novel differentiable pruning process, DSA finds the layer-wise pruning ratios with gradient-based optimization. It allocates sparsity in continuous space, which is more efficient than methods based on discrete evaluation and search. Furthermore, DSA could work in a pruning-from-scratch manner, whereas traditional budgeted pruning methods are applied to pre-trained models. Experimental results on CIFAR-10 and ImageNet show that DSA could achieve superior performance than current iterative budgeted pruning methods, and shorten the time cost of the overall pruning process by at least 1.5x in the meantime."



Paperid:128
Authors:Longhui Wei, An Xiao, Lingxi Xie, Xiaopeng Zhang, Xin Chen, Qi Tian
Title: Circumventing Outliers of AutoAugment with Knowledge Distillation
Abstract:
AutoAugment has been a powerful algorithm that improves the accuracy of many vision tasks, yet it is sensitive to the operator space as well as hyper-parameters, and an improper setting may degenerate network optimization. This paper delves deep into the working mechanism, and reveals that AutoAugment may remove part of discriminative information from the training image and so insisting on the ground-truth label is no longer the best option. To relieve the inaccuracy of supervision, we make use of knowledge distillation that refers to the output of a teacher model to guide network training. Experiments are performed in standard image classification benchmarks, and demonstrate the effectiveness of our approach in suppressing noise of data augmentation and stabilizing training. Upon the cooperation of knowledge distillation and AutoAugment, we claim the new state-of-the-art on ImageNet classification with a top-1 accuracy of 85.7%."



Paperid:129
Authors:Hugo Germain, Guillaume Bourmaud, Vincent Lepetit
Title: S2DNet: Learning Image Features for Accurate Sparse-to-Dense Matching
Abstract:
Establishing robust and accurate correspondences is a fundamental backbone to many computer vision algorithms. While recent learning-based feature matching methods have shown promising results in providing robust correspondences under challenging conditions, they are often limited in terms of precision. In this paper, we introduce S2DNet, a novel feature matching pipeline, designed and trained to efficiently establish both robust and accurate correspondences. By leveraging a sparse-to-dense matching paradigm, we cast the correspondence learning problem as a supervised classification task to learn to output highly peaked correspondence maps. We show that S2DNet achieves state-of-the-art results on the HPatches benchmark, as well as on several long-term visual localization datasets. "



Paperid:130
Authors:Peixuan Li, Huaici Zhao, Pengfei Liu, Feidao Cao
Title: RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving
Abstract:
In this work, we propose an efficient and accurate monocular 3D detection framework in single shot. Most successful 3D detectors take the projection constraint from the 3D bounding box to the 2D box as an important component. Four edges of a 2D box provide only four constraints and the performance deteriorates dramatically with the small error of the 2D detector. Different from these approaches, our method predicts the nine perspective keypoints of a 3D bounding box in image space, and then utilize the geometric relationship of 3D and 2D perspectives to recover the dimension, location, and orientation in 3D space. In this method, the properties of the object can be predicted stably even when the estimation of keypoints is very noisy, which enables us to obtain fast detection speed with a small architecture. Training our method only uses the 3D properties of the object without any extra annotations, category-specific 3D shape priors, or depth maps. Our method is the first real-time system (FPS>24) for monocular image 3D detection while achieves state-of-the-art performance on the KITTI benchmark."



Paperid:131
Authors:Xiankai Lu, Wenguan Wang, Martin Danelljan, Tianfei Zhou, Jianbing Shen, Luc Van Gool
Title: Video Object Segmentation with Episodic Graph Memory Networks
Abstract:
How to make a segmentation model efficiently adapt to a specific video as well as online target appearance variations is a fun- damental issue in the field of video object segmentation. In this work, a graph memory network is developed to address the novel idea of “learning to update the segmentation model”. Specifically, we exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. Further, learnable controllers are embedded to ease memory reading and writing, as well as maintain a fixed memory scale. The structured, external memory design enables our model to comprehensively mine and quickly store new knowl- edge, even with limited visual information, and the differentiable memory controllers slowly learn an abstract method for storing useful represen- tations in the memory and how to later use these representations for prediction, via gradient descent. In addition, the proposed graph mem- ory network yields a neat yet principled framework, which can generalize well to both one-shot and zero-shot video object segmentation tasks. Ex- tensive experiments on four challenging benchmark datasets verify that our graph memory network is able to facilitate the adaptation of the segmentation network for case-by-case video object segmentation."



Paperid:132
Authors:Daquan Zhou, Qibin Hou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan
Title: Rethinking Bottleneck Structure for Efficient Mobile Network Design
Abstract:
The inverted residual block is dominating architecture design for mobile networks recently. It changes the classic residual bottleneck by introducing two design rules: learning inverted residuals and using linear bottlenecks. In this paper, we rethink the necessity of such design change and find it may bring risks of information loss and gradient confusion. We thus propose to flip the structure and present a novel bottleneck design, called sandglass block, that performs identity mapping and spatial transformation at higher dimensions and thus alleviates information loss and gradient confusion effectively. Extensively experiments demonstrate that, different from the common belief, such bottleneck structure is indeed more beneficial than the inverted ones for mobile networks. In ImageNet classification, by simply replacing the inverted residual block with sandglass block, without increasing parameters and computations, the classification accuracy can be improved by more than 1.7% over MobileNetV2. On Pascal VOC 2007 test set, we observe that there is also 0.9% mAP improvement in object detection. We further verify the effectiveness of sandglass block by adding it into the search space of neural architecture search method DARTS. With 25% parameter reduction, the classification accuracy is improved by 0.13% over previous DARTS models. Code will be made publicly available."



Paperid:133
Authors:Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik
Title: Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks
Abstract:
When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than starting from randomly initialized weights. Adaptation can be useful in cases when training data is scarce, when a single learner needs to perform multiple tasks, or when one wishes to encode priors in the network. The most commonly employed approaches for network adaptation are fine-tuning and using the pre-trained network as a fixed feature extractor, among others. In this paper, we propose a straightforward alternative: side-tuning. Side-tuning adapts a pre-trained network by training a lightweight ""side"" network that is fused with the (unchanged) pre-trained network via summation. This simple method works as well as or better than existing solutions and it resolves some of the basic issues with fine-tuning, fixed features, and other common approaches. In particular, side-tuning is less prone to overfitting, is asymptotically consistent, and does not suffer from catastrophic forgetting in incremental learning. We demonstrate the performance of side-tuning under a diverse set of scenarios, including incremental learning (iCIFAR, iTaskonomy), reinforcement learning, imitation learning (visual navigation in Habitat), NLP question-answering (SQuAD v2), and single-task transfer learning (Taskonomy), with consistently promising results."



Paperid:134
Authors:Zerui Chen, Yan Huang, Hongyuan Yu, Bin Xue, Ke Han, Yiru Guo, Liang Wang
Title: Towards Part-aware Monocular 3D Human Pose Estimation: An Architecture Search Approach
Abstract:
Even though most existing monocular 3D pose estimation approaches achieve very competitive results, they ignore the heterogeneity among human body parts by estimating them with the same network architecture. To accurately estimate 3D poses of different body parts, we attempt to build a part-aware 3D pose estimator by searching a set of network architectures. Consequently, our model automatically learns to select a suitable architecture to estimate each body part. Compared to models built on the commonly used ResNet-50 backbone, it reduces 62\% parameters and achieves better performance. With roughly the same computational complexity as previous models, our approach achieves state-of-the-art results on both the single-person and multi-person 3D pose estimation benchmarks."



Paperid:135
Authors:Angelina Wang, Arvind Narayanan, Olga Russakovsky
Title: REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets
Abstract:
Machine learning models are known to perpetuate and even amplify the biases present in the data. However, these data biases frequently do not become apparent until after the models are deployed. To tackle this issue and to enable the preemptive analysis of large-scale dataset, we present our tool. REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset, surfacing potential biases currently along three dimensions: (1) object-based, (2) gender-based, and (3) geography-based. Object-based biases relate to size, context, or diversity of object representation. Gender-based metrics aim to reveal the stereotypical portrayal of people of different genders. Geography-based analyses consider the representation of different geographic locations. REVISE sheds light on the dataset along these dimensions; the responsibility then lies with the user to consider the cultural and historical context, and to determine which of the revealed biases may be problematic. The tool then further assists the user by suggesting actionable steps that may be taken to mitigate the revealed biases. Overall, the key aim of our work is to tackle the machine learning bias problem early in the pipeline. REVISE is available at https://github.com/princetonvisualai/revise-tool."



Paperid:136
Authors:Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, Derek Hoiem
Title: Contrastive Learning for Weakly Supervised Phrase Grounding
Abstract:
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $\sim10\%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7\%$ to achieve $76.7\%$ accuracy on Flickr30K Entities benchmark. Our code and project material will be available at http://tanmaygupta.info/info-ground."



Paperid:137
Authors:Siyuan Yang, Jun Liu, Shijian Lu, Meng Hwa Er, Alex C. Kot
Title: Collaborative Learning of Gesture Recognition and 3D Hand Pose Estimation with Multi-Order Feature Analysis
Abstract:
Gesture recognition and 3D hand pose estimation are two highly correlated tasks, yet they are often handled separately. In this paper, we present a novel collaborative learning network for joint gesture recognition and 3D hand pose estimation. The proposed network exploits joint-aware features that are crucial for both tasks, with which gesture recognition and 3D hand pose estimation boost each other to learn highly discriminative features and models. In addition, a novel multi-order feature analysis method is introduced which learns posture and multi-order motion information from the intermediate feature maps of videos effectively and efficiently. Due to the exploitation of joint-aware features in common, the proposed technique is capable of learning gesture recognition and 3D hand pose estimation even when only gesture or pose labels are available, and this enables weakly supervised network learning with much reduced data labeling efforts. Extensive experiments show that our proposed method achieves superior gesture recognition and 3D hand pose estimation performance as compared with the state-of-the-art. Codes and models will be released upon the paper acceptance. "



Paperid:138
Authors:Zuxuan Wu, Ser-Nam Lim, Larry S. Davis, Tom Goldstein
Title: Making an Invisibility Cloak: Real World Adversarial Attacks on Object Detectors
Abstract:
We present a systematic study of adversarial attacks on state-of-the-art object detection frameworks. Using standard detection datasets, we train patterns that suppress the objectness scores produced by a range of commonly used detectors, and ensembles of detectors. Through extensive experiments, we benchmark the effectiveness of adversarially trained patches under both white-box and black-box settings, and quantify transferability of attacks between datasets, object classes, and detector models. Finally, we present a detailed study of physical world attacks using printed posters and wearable clothes, and rigorously quantify the performance of such attacks with different metrics."



Paperid:139
Authors:Jianxin Lin, Yingxue Pang, Yingce Xia, Zhibo Chen, Jiebo Luo
Title: TuiGAN: Learning Versatile Image-to-Image Translation with Two Unpaired Images
Abstract:
An unsupervised image-to-image translation (UI2I) task deals with learning a mapping between two domains without paired images. While existing UI2I methods usually require numerous unpaired images from different domains for training, there are many scenarios where training data is quite limited. In this paper, we argue that even if each domain contains a single image, UI2I can still be achieved. To this end, we propose TuiGAN, a generative model that is trained on only two unpaired images and amounts to one-shot unsupervised learning. With TuiGAN, an image is translated in a coarse-to-fine manner where the generated image is gradually refined from global structures to local details. We conduct extensive experiments to verify that our versatile method can outperform strong baselines on a wide variety of UI2I tasks. Moreover, TuiGAN is capable of achieving comparable performance with the state-of-the-art UI2I models trained with sufficient data."



Paperid:140
Authors:Hang Du, Hailin Shi, Yuchi Liu, Jun Wang, Zhen Lei, Dan Zeng, Tao Mei
Title: Semi-Siamese Training for Shallow Face Learning
Abstract:
Most existing public face datasets, such as MS-Celeb-1M and VGGFace2, provide abundant information in both breadth (large number of IDs) and depth (sufficient number of samples) for training. However, in many real-world scenarios of face recognition, the training dataset is limited in depth, $ extit{i.e.}$ only two face images are available for each ID. $ extit{We define this situation as Shallow Face Learning, and find it problematic with existing training methods.}$ Unlike deep face data, the shallow face data lacks intra-class diversity. As such, it can lead to collapse of feature dimension and consequently the learned network can easily suffer from degeneration and over-fitting in the collapsed dimension. In this paper, we aim to address the problem by introducing a novel training method named Semi-Siamese Training (SST). A pair of Semi-Siamese networks constitute the forward propagation structure, and the training loss is computed with an updating gallery queue, conducting effective optimization on shallow training data. Our method is developed without extra-dependency, thus can be flexibly integrated with the existing loss functions and network architectures. Extensive experiments on various benchmarks of face recognition show the proposed method significantly improves the training, not only in shallow face learning, but also for conventional deep face data."



Paperid:141
Authors:Haotao Wang, Shupeng Gui, Haichuan Yang, Ji Liu, Zhangyang Wang
Title: GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework
Abstract:
Generative adversarial networks (GANs) have gained increasing popularity in various computer vision applications, and recently start to be deployed to resource-constrained mobile devices. Similar to other deep models, state-of-the-art GANs also suffer from high parameter complexities. That has recently motivated the exploration of compressing GANs (usually generators). Compared to the vast literature and prevailing success in compressing deep classifiers, the study of GAN compression remains in its infancy, so far leveraging individual compression techniques instead of more sophisticated combinations. We observe that due to the notorious instability of training GANs, heuristically stacking different compression techniques will result in unsatisfactory results. To this end, we propose the first end-to-end optimization framework combining multiple compression means for GAN compression, dubbed GAN Slimming (GS). GS seamlessly integrates three mainstream compression techniques: model distillation, channel pruning and quantization, together with the GAN minimax objective, into one unified optimization form, that can be efficiently optimized from end to end. Without bells and whistles, GS largely outperforms existing options in compressing image-to-image translation GANs. Specifically, we apply GS to compress CartoonGAN, a state-of-the-art style transfer network, by up to $47 imes$ times, with minimal visual quality degradation. Codes and pre-trained models can be found at https://github.com/TAMU-VITA/GAN-Slimming."



Paperid:142
Authors:Yukun Su, Guosheng Lin, Jinhui Zhu, Qingyao Wu
Title: Human Interaction Learning on 3D Skeleton Point Clouds for Video Violence Recognition
Abstract:
This paper introduces a new method for recognizing violent behavior by learning contextual relationships between related people from human skeleton points. Unlike previous work, we first formulate 3D skeleton point clouds from human skeleton sequences extracted from videos and then perform interaction learning on these 3D skeleton point clouds. A novel extbf{S}keleton extbf{P}oints extbf{I}nteraction extbf{L}earning (SPIL) module, is proposed to model the interactions between skeleton points. Specifically, by constructing a specific weight distribution strategy between local regional points, SPIL aims to selectively focus on the most relevant parts of them based on their features and spatial-temporal position information. In order to capture diverse types of relation information, a multi-head mechanism is designed to aggregate different features from independent heads to jointly handle different types of relationships between points. Experimental results show that our model outperforms the existing networks and achieves new state-of-the-art performance on video violence datasets."



Paperid:143
Authors:Jingwei Xin, Nannan Wang, Xinrui Jiang, Jie Li, Heng Huang, Xinbo Gao
Title: Binarized Neural Network for Single Image Super Resolution
Abstract:
Lighter model and faster inference are the focus of current single image super-resolution (SISR) research. However, existing methods are still hard to be applied in real-world applications due to the requirement of its heavy computation. Model quantization is an effective way to significantly reduce model size and computation time. We propose a simple but effective binary neural networks (BNN) based SISR model with a novel binarization scheme. Specially, we design a bit-accumulation mechanism to approximate the full-precision values, which could realize the approximation to the full precision number by accumulating the multi-layer's one-bit values.The proposed BNN-based SISR method could achieve superior performance with lower computational complexity and less model parameters. Extensive experiments show that the proposed model outperforms the state-of-the-art methods (binarization methods such as BNN, DoReFa-Net and ABC-Net) by large margins on 4 benchmark datasets, specially by average more than 0.8 dB in terms of Peak Signal-to-Noise Ratio (PSNR) on Set5 dataset."



Paperid:144
Authors:Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, Liang-Chieh Chen
Title: Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
Abstract:
Convolution exploits locality for efficiency at a cost of missing long range context. Self-attention has been adopted to augment CNNs with non-local interactions. Recent works prove it possible to stack self-attention layers to obtain a fully attentional network by restricting the attention to a local region. In this paper, we attempt to remove this constraint by factorizing 2D self-attention into two 1D self-attentions. This reduces computation complexity and allows performing attention within a larger or even global region. In companion, we also propose a position-sensitive self-attention design. Combining both yields our position-sensitive axial-attention layer, a novel building block that one could stack to form axial-attention models for image classification and dense prediction. We demonstrate the effectiveness of our model on four large-scale datasets. In particular, our model outperforms all existing stand-alone self-attention models on ImageNet. Our Axial-DeepLab improves 2.8% PQ over bottom-up state-of-the-art on COCO test-dev. This previous state-of-the-art is attained by our small variant that is 3.8x parameter-efficient and 27x computation-efficient. Axial-DeepLab also achieves state-of-the-art results on Mapillary Vistas and Cityscapes."



Paperid:145
Authors:Zhipeng Fan, Jun Liu, Yao Wang
Title: Adaptive Computationally Efficient Network for Monocular 3D Hand Pose Estimation
Abstract:
3D hand pose estimation is an important task for a wide range of real-world applications. Existing works in this domain mainly focus on designing advanced algorithms to achieve high pose estimation accuracy. However, besides accuracy, the computation efficiency that affects the computation speed and power consumption is also crucial for real-world applications. In this paper, we investigate the problem of reducing the overall computation cost yet maintaining the high accuracy for 3D hand pose estimation from video sequences. A novel model, called Adaptive Computationally Efficient (ACE) network, is proposed, which takes advantage of a Gaussian kernel based Gate Module to dynamically switch the computation between a light model and a heavy network for feature extraction. Our model employs the light model to compute efficient features for most of the frames and invokes the heavy model only when necessary. Combined with the temporal context, the proposed model accurately estimates the 3D hand pose. We evaluate our model on two publicly available datasets, and achieve state-of-the-art performance at only 22% of the computation cost compared to the ordinary temporal models."



Paperid:146
Authors:Jinlong Peng, Changan Wang, Fangbin Wan, Yang Wu, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu
Title: Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking
Abstract:
Existing Multiple-Object Tracking (MOT) methods either follow the tracking-by-detection paradigm to conduct object detection, feature extraction and data association separately, or have two of the three subtasks integrated to form a partially end-to-end solution. Going beyond these sub-optimal frameworks, we propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution (the first as far as we know). It chains paired bounding boxes regression results estimated from overlapping nodes, of which each node covers two adjacent frames. The paired regression is made attentive by object-attention (brought by a detection module) and identity-attention (ensured by an ID verification module). The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective, setting new MOTA records on MOT16 and MOT17 challenge datasets (67.6 and 66.6, respectively), without relying on any extra training data. The source code of CTracker can be found at: github.com/pjl1995/CTracker."



Paperid:147
Authors:Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, Dahua Lin
Title: Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets
Abstract:
We present a new loss function called Distribution-Balanced Loss for the multi-label recognition problems that exhibit long-tailed class distributions. Compared to conventional single-label classification problem, multi-label recognition problems are often more challenging due to two significant issues, namely the co-occurrence of labels and the dominance of negative labels (when treated as multiple binary classification problems). The Distribution-Balanced Loss tackles these issues through two key modifications to the standard binary cross-entropy loss: 1) a new way to re-balance the weights that takes into account the impact caused by label co-occurrence, and 2) a negative tolerant regularization to mitigate the over-suppression of negative labels. Experiments on both Pascal VOC and COCO show that the models trained with this new loss function achieve significant performance gains over existing methods. "



Paperid:148
Authors:Marvin Eisenberger, Daniel Cremers
Title: Hamiltonian Dynamics for Real-World Shape Interpolation
Abstract:
We revisit the classical problem of 3D shape interpolation and propose a novel, physically plausible approach based on Hamiltonian dynamics. While most prior work focuses on synthetic input shapes, our formulation is designed to be applicable to real-world scans with imperfect input correspondences and various types of noise. To that end, we use recent progress on dynamic thin shell simulation and divergence-free shape deformation and combine them to address the inverse problem of finding a plausible intermediate sequence for two input shapes. In comparison to prior work that mainly focuses on small distortion of consecutive frames, we explicitly model volume preservation and momentum conservation, as well as an anisotropic local distortion model. We argue that, in order to get a robust interpolation for imperfect inputs, we need to model the input noise explicitly which results in an alignment based formulation. Finally, we show a qualitative and quantitative improvement over prior work on a broad range of synthetic and scanned data. Besides being more robust to noisy inputs, our method yields exactly volume preserving intermediate shapes, avoids self-intersections and is scalable to high resolution scans."



Paperid:149
Authors:Andrea Burns, Donghyun Kim, Derry Wijaya, Kate Saenko, Bryan A. Plummer
Title: Learning to Scale Multilingual Representations for Vision-Language Tasks
Abstract:
Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods."



Paperid:150
Authors:Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
Title: Multi-modal Transformer for Video Retrieval
Abstract:
The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit cross-modal cues present in video. Furthermore, they aggregate per-frame visual features with limited or no temporal information. In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others. The transformer architecture is also leveraged to encode and model the temporal information. On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer. This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets. More details are available at http://thoth.inrialpes.fr/research/MMT."



Paperid:151
Authors:Yanchun Xie, Jimin Xiao, Mingjie Sun, Chao Yao, Kaizhu Huang
Title: Feature Representation Matters: End-to-End Learning for Reference-based Image Super-resolution
Abstract:
In this paper, we are aiming for a general reference-based super-resolution setting: it does not require the low-resolution image and the high-resolution reference image to be well aligned or with a similar texture. Instead, we only intend to transfer the relevant textures from reference images to the output super-resolution image. To this end, we engaged neural texture transfer to swap texture features between the low-resolution image and the high-resolution reference image. We identify the importance of designing a super-resolution task-specific features rather than classification oriented features for neural texture transfer, making the feature extractor more compatible with the image synthesis task. We develop an end-to-end training framework for the reference-based super-resolution task, where the feature encoding network prior to matching and swapping is jointly trained with the image synthesis network. We also discover that learning the high-frequency residual is an effective way for the reference-based super-resolution task. Without bells and whistles, the proposed method E2ENT2 achieved better performance than state-of-the method (i.e., SRNTT with five loss functions) with only two basic loss functions. Extensive experimental results on several datasets demonstrate that the proposed method E2ENT2 can achieve superior performance to existing best models both quantitatively and qualitatively."



Paperid:152
Authors:Zhuo Su, Lan Xu, Zerong Zheng, Tao Yu, Yebin Liu, Lu Fang
Title: RobustFusion: Human Volumetric Capture with Data-driven Visual Cues using a RGBD Camera
Abstract:
High-quality and complete 4D reconstruction of human activities is critical for immersive VR/AR experience, but it suffers from inherent self-scanning constraint and consequent fragile tracking under the monocular setting. In this paper, inspired by the huge potential of learning-based human modeling, we propose RobustFusion, a robust human performance capture system combined with various data-driven visual cues using a single RGBD camera. To break the orchestrated self-scanning constraint, we propose a data-driven model completion scheme to generate a complete and fine-detailed initial model using only the front-view input. To enable robust tracking, we embrace both the initial model and the various visual cues into a novel performance capture scheme with hybrid motion optimization and semantic volumetric fusion, which can successfully capture challenging human motions under the monocular setting without pre-scanned detailed template and owns the reinitialization ability to recover from tracking failures and the disappear-reoccur scenarios. Extensive experiments demonstrate the robustness of our approach to achieve high-quality 4D reconstruction for challenging human motions, liberating the cumbersome self-scanning constraint."



Paperid:153
Authors:Tien Do, Khiem Vuong, Stergios I. Roumeliotis, Hyun Soo Park
Title: Surface Normal Estimation of Tilted Images via Spatial Rectifier
Abstract:
In this paper, we present a spatial rectifier to estimate surface normals of tilted images. Tilted images are of particular interest as more visual data are captured by arbitrarily oriented sensors such as body-/robot-mounted cameras. Existing approaches exhibit bounded performance on predicting surface normals because they were trained using gravity-aligned images. Our two main hypotheses are: (1) visual scene layout is indicative of the gravity direction; and (2) not all surfaces are equally represented by a learned estimator due to the structured distribution of the training data, i.e., there exists a transformation for each tilted image that is more responsive to the learned estimator than others. We design a spatial rectifier that is learned to transform the surface normal distribution of a tilted image to the rectified one that matches the gravity-aligned training data distribution. Along with the spatial rectifier, we propose a novel truncated angular loss that offers a stronger gradient at small angular errors and robustness to outliers. The resulting estimator outperforms the state-of-the-art methods including data augmentation baselines not only on ScanNet and NYUv2 but also on a new dataset called Tilt-RGBD that includes considerable roll and pitch camera motion."



Paperid:154
Authors:Rundi Wu, Xuelin Chen, Yixin Zhuang, Baoquan Chen
Title: Multimodal Shape Completion via Conditional Generative Adversarial Networks
Abstract:
Several deep learning methods have been proposed for completing partial data from shape acquisition setups, i.e., filling the regions that were missing in the shape. These methods, however, only complete the partial shape with a single output, ignoring the ambiguity when reasoning the missing geometry. Hence, we pose a multi-modal shape completion problem, in which we seek to complete the partial shape with multiple outputs by learning a one-to-many mapping. We develop the first multimodal shape completion method that completes the partial shape via conditional generative modeling, without requiring paired training data. Our approach distills the ambiguity by conditioning the completion on a learned multimodal distribution of possible results. We extensively evaluate the approach on several datasets that contain varying forms of shape incompleteness, and compare among several baseline methods and variants of our methods qualitatively and quantitatively, demonstrating the merit of our method in completing partial shapes with both diversity and quality."



Paperid:155
Authors:JunYoung Gwak, Christopher Choy, Silvio Savarese
Title: Generative Sparse Detection Networks for 3D Single-shot Object Detection
Abstract:
3D object detection has been widely studied due to its potential applicability to many promising areas such as robotics and augmented reality. Yet, the sparse nature of the 3D data poses unique challenges to this task. Most notably, the observable surface of the 3D point clouds is disjoint from the center of the instance to ground the bounding box prediction on. To this end, we propose Generative Sparse Detection Network (GSDN), a fully-convolutional single-shot sparse detection network that efficiently generates the support for object proposals. The key component of our model is a generative sparse tensor decoder, which uses a series of transposed convolutions and pruning layers to expand the support of sparse tensors while discarding unlikely object centers to maintain minimal runtime and memory footprint. GSDN can process unprecedentedly large-scale inputs with a single fully-convolutional feed-forward pass, thus does not require the heuristic post-processing stage that stitches results from sliding windows as other previous methods have. We validate our approach on three 3D indoor datasets including the large-scale 3D indoor reconstruction dataset where our method outperforms the state-of-the-art methods by a relative improvement of 7.14% while being 3.78 times faster than the best prior work."



Paperid:156
Authors:Sarah Pratt, Mark Yatskar, Luca Weihs, Ali Farhadi, Aniruddha Kembhavi
Title: Grounded Situation Recognition
Abstract:
We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imsitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training with late fusion on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at https://prior.allenai.org/projects/gsr ."



Paperid:157
Authors:Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
Title: Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
Abstract:
Automatically generating sentences to describe events and temporally localizing sentences in a video are two important tasks that bridge language and videos. Recent techniques leverage the multimodal nature of videos by using off-the-shelf features to represent videos, but interactions between modalities are rarely explored. Inspired by the fact that there exist cross-modal interactions in the human brain, we propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos and thus improve performances on both tasks. We model modality interaction in both the sequence and channel levels in a pairwise fashion, and the pairwise interaction also provides some explainability for the predictions of target tasks. We demonstrate the effectiveness of our method and validate specific design choices through extensive ablation studies. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets: MSVD and MSR-VTT (event captioning task), and Charades-STA and ActivityNet Captions (temporal sentence localization task)."



Paperid:158
Authors:Xiaohe Wu, Ming Liu, Yue Cao, Dongwei Ren, Wangmeng Zuo
Title: Unpaired Learning of Deep Image Denoising
Abstract:
We investigate the task of learning blind image denoising networks from an unpaired set of clean and noisy images. Such problem setting generally is practical and valuable considering that it is feasible to collect unpaired noisy and clean images in most real-world applications. And we further assume that the noise can be signal dependent but is spatially uncorrelated. In order to facilitate unpaired learning of denoising network, this paper presents a two-stage scheme by incorporating self-supervised learning and knowledge distillation. For self-supervised learning, we suggest a dilated blind-spot network (D-BSN) to learn denoising solely from real noisy images. Due to the spatial independence of noise, we adopt a network by stacking $1 imes1$ convolution layers to estimate the noise level map for each image. Both the D-BSN and image-specific noise model ($ ext{CNN}_{ ext{est}}$) can be jointly trained via maximizing the constrained log-likelihood. Given the output of D-BSN and estimated noise level map, improved denoising performance can be further obtained based on the Bayes' rule. As for knowledge distillation, we first apply the learned noise models to clean images to synthesize a paired set of training images, and use the real noisy images and the corresponding denoising results in the first stage to form another paired set. Then, the ultimate denoising model can be distilled by training an existing denoising network using these two paired sets. Experiments show that our unpaired learning method performs favorably on both synthetic noisy images and real-world noisy photographs in terms of quantitative and qualitative evaluation. Code is available at \url{https://github.com/XHWXD/DBSN}."



Paperid:159
Authors:Yixiao Ge, Haibo Wang, Feng Zhu, Rui Zhao, Hongsheng Li
Title: Self-supervising Fine-grained Region Similarities for Large-scale Image Localization
Abstract:
The task of large-scale retrieval-based image localization is to estimate the geographical location of a query image by recognizing its nearest reference images from a city-scale dataset. However, the general public benchmarks only provide noisy GPS labels associated with the training images, which act as weak supervisions for learning image-to-image similarities. Such label noise prevents deep neural networks from learning discriminative features for accurate localization. To tackle this challenge, we propose to self-supervise image-to-region similarities in order to fully explore the potential of difficult positive images alongside their sub-regions. The estimated image-to-region similarities can serve as extra training supervision for improving the network in generations, which could in turn gradually refine the fine-grained similarities to achieve optimal performance. Our proposed self-enhanced image-to-region similarity labels effectively deal with the training bottleneck in the state-of-the-art pipelines without any additional parameters or manual annotations in both training and inference. Our method outperforms state-of-the-arts on the standard localization benchmarks by noticeable margins and shows excellent generalization capability on multiple image retrieval datasets."



Paperid:160
Authors:Youngjoong Kwon, Stefano Petrangeli, Dahun Kim, Haoliang Wang, Eunbyung Park, Viswanathan Swaminathan, Henry Fuchs
Title: Rotationally-Temporally Consistent Novel View Synthesis of Human Performance Video
Abstract:
Novel view video synthesis aims to synthesize novel viewpoints videos given input captures of a human performance taken from multiple reference viewpoints and over consecutive time steps. Despite great advances in model-free novel view synthesis, existing methods present three limitations when applied to complex and time-varying human performance. First, these methods (and related datasets) mainly consider simple and symmetric objects. Second, they do not enforce explicit consistency across generated views. Third, they focus on static and non-moving objects. The fine-grained details of a human subject can therefore suffer from inconsistencies when synthesized across different viewpoints or time steps. To tackle these challenges, we introduce a human-specific framework that employs a learned 3D-aware representation. Specifically, we first introduce a novel siamese network that employs a gating layer for better reconstruction of the latent volumetric representation and, consequently, final visual results. Moreover, features from consecutive time steps are shared inside the network to improve temporal consistency. Second, we introduce a novel loss to explicitly enforce consistency across generated views both in space and in time. Third, we present the Multi-View Human Action (MVHA) dataset, consisting of near 1200 synthetic human performance captured from 54 viewpoints. Experiments on the MVHA, Pose-Varying Human Model and ShapeNet datasets show that our method outperforms the state-of-the-art baselines both in view generation quality and spatio-temporal consistency."



Paperid:161
Authors:Jiaqi Wang, Wenwei Zhang, Yuhang Cao, Kai Chen, Jiangmiao Pang, Tao Gong, Jianping Shi, Chen Change Loy, Dahua Lin
Title: Side-Aware Boundary Localization for More Precise Object Detection
Abstract:
Current object detection frameworks mainly rely on bounding box regression to localize objects. Despite the remarkable progress in recent years, the precision of bounding box regression remains unsatisfactory, hence limiting performance in object detection. We observe that precise localization requires careful placement of each side of the bounding box. However, the mainstream approach, which focuses on predicting centers and sizes, is not the most effective way to accomplish this task, especially when there exists displacements with large variance between the anchors and the targets. In this paper, we propose an alternative approach, named as Side-Aware Boundary Localization (SABL), where each side of the bounding box is respectively localized with a dedicated network branch. To tackle the difficulty of precise localization in the presence of displacements with large variance, we further propose a two-step localization scheme, which first predicts a range of movement through bucket prediction and then pinpoints the precise position within the predicted bucket. We test the proposed method on both two-stage and single-stage detection frameworks. Replacing the standard bounding box regression branch with the proposed design leads to significant improvements on Faster R-CNN, RetinaNet, and Cascade R-CNN, by 3.0%, 1.7%, and 0.9%, respectively. Code is available at https://github.com/open-mmlab/mmdetection."



Paperid:162
Authors:Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou
Title: SF-Net: Single-Frame Supervision for Temporal Action Localization
Abstract:
In this paper, we study an intermediate form of supervision, i.e., single-frame supervision, for temporal action localization (TAL). To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action. This can significantly reduce the labor cost of obtaining full supervision which requires annotating the action boundary. Compared to the weak supervision that only annotates the video-level label, the single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead. To make full use of such single-frame supervision, we propose a unified system called SF-Net. First, we propose to predict an actionness score for each video frame. Along with a typical category score, the actionness score can provide comprehensive information about the occurrence of a potential action and aid the temporal boundary refinement during inference. Second, we mine pseudo action and background frames based on the single-frame annotations. We identify pseudo action frames by adaptively expanding each annotated single frame to its nearby, contextual frames and we mine pseudo background frames from all the unannotated frames across multiple videos. Together with the ground-truth labeled frames, these pseudo-labeled frames are further used for training the classifier."



Paperid:163
Authors:Bin Liu, Yue Cao, Yutong Lin, Qi Li, Zheng Zhang, Mingsheng Long, Han Hu
Title: Negative Margin Matters: Understanding Margin in Few-shot Classification
Abstract:
In this paper, we unconventionally propose to adopt appropriate negative-margin to softmax loss for few-shot classification, which surprisingly works well for the open-set scenarios of few-shot classification. We then provide the intuitive explanation and the theoretical proof to understand why negative margin works well for few-shot classification. This claim is also demonstrated via sufficient experiments. With the negative-margin softmax loss, our approach achieves the state-of-the-art performance on all three standard benchmarks of few-shot classification. In the future, the negative margin may be applied in more general open-set scenarios that do not restrict the number of samples in novel classes."



Paperid:164
Authors:Ruizheng Wu, Xin Tao, Yingcong Chen, Xiaoyong Shen, Jiaya Jia
Title: Particularity beyond Commonality: Unpaired Identity Transfer with Multiple References
Abstract:
Unpaired image-to-image translation aims to translate images from the source class to target one by providing sufficient data for these classes. Current few-shot translation methods use multiple reference images to describe the target domain through extracting common features. In this paper, we focus on a more specific identity transfer problem and advocate that particular property in each individual image can also benefit generation. We accordingly propose a new multi-reference identity transfer framework by simultaneously making use of particularity and commonality of reference. It is achieved via a semantic pyramid alignment module to make proper use of geometric information for individual images, as well as an attention module to aggregate for the final transformation. Extensive experiments demonstrate the effectiveness of our framework given the promising results in a number of identity transfer applications."



Paperid:165
Authors:Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl
Title: Tracking Objects as Points
Abstract:
Tracking has traditionally been the art of following interest points through space and time. This changed with the rise of powerful deep networks. Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection. In this paper, we present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. Our tracker, CenterTrack, applies a detection model to a pair of images and detections from the prior frame. Given this minimal input, CenterTrack localizes objects and predicts their associations with the previous frame. That's it. CenterTrack is simple, online (no peeking into the future), and real-time. It achieves $67.3\%$ MOTA on the MOT17 challenge at 17 FPS and $89.4\%$ MOTA on the KITTI tracking benchmark at 12 FPS, setting a new state of the art on both datasets. CenterTrack is easily extended to monocular 3D tracking by regressing additional 3D attributes. Using monocular video input, it achieves $28.3\%$ AMOTA@0.2 on the newly released nuScenes 3D tracking benchmark, substantially outperforming the monocular baseline on this benchmark while running at 22 FPS."



Paperid:166
Authors:Jiadong Liang, Wenjie Pei, Feng Lu
Title: CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis
Abstract:
Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. Particularly, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images during text encoding. Meanwhile, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator to model the fine-grained correlations between words and image sub-regions to push for the text-image semantic alignment. Extensive experiments on COCO dataset manifest that our model advances the state-of-the-art performance significantly (from 35.69 to 52.73 in Inception Score)."



Paperid:167
Authors:Fariborz Taherkhani, Ali Dabouei, Sobhan Soleymani, Jeremy Dawson, Nasser M. Nasrabadi
Title: Transporting Labels via Hierarchical Optimal Transport for Semi-Supervised Learning
Abstract:
Semi-Supervised Learning (SSL) based on Convolutional Neural Networks (CNNs) have recently been proven as powerful tools for standard tasks such as image classification when there is not a sufficient amount of labeled data available during the training. In this work, we consider the general setting of the SSL problem for image classification, where the labeled and unlabeled data come from the same underlying distribution. We propose a new SSL method that adopts a hierarchical Optimal Transport (OT) technique to find a mapping from empirical unlabeled measures to corresponding labeled measures by leveraging the minimum amount of transportation cost in the label space. Based on this mapping, pseudo-labels for the unlabeled data are inferred, which are then used along with the labeled data for training the CNN. We evaluated and compared our method with state-of-the-art SSL approaches on standard datasets to demonstrate the superiority of our SSL method."



Paperid:168
Authors:Simon Vandenhende, Stamatios Georgoulis, Luc Van Gool
Title: MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning
Abstract:
In this paper, we argue about the importance of considering task interactions at multiple scales when distilling task information in a multi-task learning setup. In contrast to common belief, we show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales, and vice versa. We propose a novel architecture, namely MTI-Net, that builds upon this finding in three ways. First, it explicitly models task interactions at every scale via a multi-scale multi-modal distillation unit. Second, it propagates distilled task information from lower to higher scales via a feature propagation module. Third, it aggregates the refined task features from all scales via a feature aggregation unit to produce the final per-task predictions.Extensive experiments on two multi-task dense labeling datasets show that, unlike prior work, our multi-task model delivers on the full potential of multi-task learning, that is, smaller memory footprint, reduced number of calculations, and better performance w.r.t. single-task learning. The code is made publicly available."



Paperid:169
Authors:Andrew Liu, Shiry Ginosar, Tinghui Zhou, Alexei A. Efros, Noah Snavely
Title: Learning to Factorize and Relight a City
Abstract:
We propose a learning-based framework for disentangling outdoor scenes into temporally-varying illumination and permanent scene factors. Inspired by the classic intrinsic image decomposition, our learning signal builds upon two insights: 1) combining the disentangled factors should reconstruct the original image, and 2) the permanent factors should stay constant across multiple temporal samples of the same scene. To facilitate training, we assemble a city-scale dataset of outdoor timelapse imagery from Google Street View, where the same locations are captured repeatedly through time. This data represents an unprecedented scale of spatio-temporal outdoor imagery. We show that our learned disentangled factors can be used to manipulate novel images in realistic ways, such as changing lighting effects and scene geometry. Please visit factorize-a-city.github.io for animated results."



Paperid:170
Authors:Guo-Sen Xie, Li Liu, Fan Zhu, Fang Zhao, Zheng Zhang, Yazhou Yao, Jie Qin, Ling Shao
Title: Region Graph Embedding Network for Zero-Shot Learning
Abstract:
Most of the existing Zero-Shot Learning (ZSL) approaches learn direct embeddings from global features or image parts (regions) to the semantic space, which, however, fail to capture the appearance relationships between different local regions within a single image. In this paper, to model the relations among local image regions, we incorporate the region-based relation reasoning into ZSL. Our method, termed as Region Graph Embedding Network (RGEN), is trained end-to-end from raw image data. Specifically, RGEN consists of two branches: the Constrained Part Attention (CPA) branch and the Parts Relation Reasoning (PRR) branch. CPA branch is built upon attention and produces the image regions. To exploit the progressive interactions among these regions, we represent them as a region graph, on which the parts relation reasoning is performed with graph convolutions, thus leading to our PRR branch. To train our model, we introduce both a transfer loss and a balance loss to contrast class similarities and pursue the maximum response consistency among seen and unseen outputs, respectively. Extensive experiments on four datasets well validate the effectiveness of the proposed method under both ZSL and generalized ZSL settings."



Paperid:171
Authors:Omid Taheri, Nima Ghorbani, Michael J. Black, Dimitrios Tzionas
Title: GRAB: A Dataset of Whole-Body Human Grasping of Objects
Abstract:
Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While ""grasping"" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of ""whole-body grasps"". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de."



Paperid:172
Authors:Edgar Tretschk, Ayush Tewari, Michael Zollhöfer, Vladislav Golyanik, Christian Theobalt
Title: DEMEA: Deep Mesh Autoencoders for Non-Rigidly Deforming Objects
Abstract:
Mesh autoencoders are commonly used for dimensionality reduction, sampling and mesh modeling. We propose a general-purpose DEep MEsh Autoencoder \hbox{(DEMEA)} which adds a novel embedded deformation layer to a graph-convolutional mesh autoencoder. The embedded deformation layer (EDL) is a differentiable deformable geometric proxy which explicitly models point displacements of non-rigid deformations in a lower dimensional space and serves as a local rigidity regularizer. DEMEA decouples the parameterization of the deformation from the final mesh resolution since the deformation is defined over a lower dimensional embedded deformation graph. We perform a large-scale study on four different datasets of deformable objects. Reasoning about the local rigidity of meshes using EDL allows us to achieve higher-quality results for highly deformable objects, compared to directly regressing vertex positions. We demonstrate multiple applications of DEMEA, including non-rigid 3D reconstruction from depth and shading cues, non-rigid surface tracking, as well as the transfer of deformations over different meshes."



Paperid:173
Authors:Xi Shen, François Darmon, Alexei A. Efros, Mathieu Aubry
Title: RANSAC-Flow: Generic Two-stage Image Alignment
Abstract:
This paper considers the generic problem of dense alignment between two images, whether they be two frames of a video, two widely different views of a scene, two paintings depicting similar content, etc. Whereas each such task is typically addressed with a domain-specific solution, we show that a simple unsupervised approach performs surprisingly well across a range of tasks. Our main insight is that parametric and non-parametric alignment methods have complementary strengths. We propose a two-stage process: first, a feature-based parametric coarse alignment using one or more homographies, followed by non-parametric fine pixel-wise alignment. Coarse alignment is performed using RANSAC on off-the-shelf deep features. Fine alignment is learned in an unsupervised way by a deep network which optimizes a standard structural similarity metric (SSIM) between the two images, plus cycle-consistency. Despite its simplicity, our method shows competitive results on a range of tasks and datasets, including unsupervised optical flow on KITTI, dense correspondences on Hpatches, two-view geometry estimation on YFCC100M, localization on Aachen Day-Night, and, for the first time, fine alignment of artworks on the Brughel dataset. Our code and data are available at http://imagine.enpc.fr/~shenx/RANSAC-Flow ."



Paperid:174
Authors:Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
Title: Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
Abstract:
Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a $360^{



Paperid:175
Authors:Kiru Park, Timothy Patten, Markus Vincze
Title: Neural Object Learning for 6D Pose Estimation Using a Few Cluttered Images
Abstract:
Recent methods for 6D pose estimation of objects assume either textured 3D models or real images that cover the entire range of target poses. However, it is difficult to obtain textured 3D models and annotate the poses of objects in real scenarios. This paper proposes a method, Neural Object Learning (NOL), that creates synthetic images of objects in arbitrary poses by combining only a few observations from cluttered images. A novel refinement step is proposed to align inaccurate poses of objects in source images, which results in better quality images. Evaluations performed on two public datasets show that the rendered images created by NOL lead to state-of-the-art performance in comparison to methods that use 13 times the number of real images. Evaluations on our new dataset show multiple objects can be trained and recognized simultaneously using a sequence of a fixed scene. "



Paperid:176
Authors:Jianfeng Yan, Zizhuang Wei, Hongwei Yi, Mingyu Ding, Runze Zhang, Yisong Chen, Guoping Wang, Yu-Wing Tai
Title: Dense Hybrid Recurrent Multi-view Stereo Net with Dynamic Consistency Checking
Abstract:
In this paper, we propose an efficient and effective dense hybrid recurrent multi-view stereo net with dynamic consistency checking, namely $D^{2}$HC-RMVSNet, for accurate dense point cloud reconstruction. Our novel hybrid recurrent multi-view stereo net consists of two core modules: 1) a light DRENet (Dense Reception Expanded) module to extract dense feature maps of original size with multi-scale context information, 2) a hybrid HRU-LSTM (Hybrid Recurrent U-LSTM) to regularize 3D matching volume into predicted depth map, which efficiently aggregates different scale information by coupling LSTM and U-Net architecture. To further improve the accuracy and completeness of reconstructed point clouds, we leverage a dynamic consistency checking strategy instead of prefixed parameters and strategies widely adopted in existing methods for dense point cloud reconstruction. In doing so, we dynamically aggregate geometric consistency matching error among all the views. Our method ranks extbf{$1^{st}$} on the complex outdoor extsl{Tanks and Temples} benchmark over all the methods. Extensive experiments on the in-door extsl{DTU} dataset show our method exhibits competitive performance to the state-of-the-art method while dramatically reduces memory consumption, which costs only $19.4\%$ of R-MVSNet memory consumption. The codebase is available at \hyperlink{https://github.com/yhw-yhw/D2HC-RMVSNet}{https://github.com/yhw-yhw/D2HC-RMVSNet}."



Paperid:177
Authors:Xuchong Qiu, Yang Xiao, Chaohui Wang, Renaud Marlet
Title: Pixel-Pair Occlusion Relationship Map (P2ORM): Formulation, Inference & Application
Abstract:
Inference & Application","We formalize concepts around geometric occlusion in 2D images (i.e., ignoring semantics), and propose a novel unified formulation of both occlusion boundaries and occlusion orientations via a pixel-pair occlusion relation. The former provides a way to generate large-scale accurate occlusion datasets while, based on the latter, we propose a novel method for task-independent pixel-level occlusion relationship estimation from single images. Experiments on a variety of datasets demonstrate that our method outperforms existing ones on this task. To further illustrate the value of our formulation, we also propose a new depth map refinement method that consistently improve the performance of state-of-the-art monocular depth estimation methods."



Paperid:178
Authors:Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, Dahua Lin
Title: MovieNet: A Holistic Dataset for Movie Understanding
Abstract:
Recent years have seen remarkable advances in visual understanding. However, how to understand a story-based long video with artistic styles, e.g. movie, remains challenging. In this paper, we introduce MovieNet -- a holistic dataset for movie understanding. MovieNet contains 1,100 movies with a large amount of multi-modal data, e.g. trailers, photos, plot descriptions, etc. Besides, different aspects of manual annotations are provided in MovieNet, including 1.1M characters with bounding boxes and identities, 42K scene boundaries, 2.5K aligned description sentences, 65K tags of place and action, and 92K tags of cinematic style. To the best of our knowledge, MovieNet is the largest dataset with richest annotations for comprehensive movie understanding. Based on MovieNet, we set up several benchmarks for movie understanding from different angles. Extensive experiments are executed on these benchmarks to show the immeasurable value of MovieNet and the gap of current approaches towards comprehensive movie understanding. We believe that such a holistic dataset would promote the researches on story-based long video understanding and beyond. MovieNet will be published in compliance with regulations at https://movienet.github.io."



Paperid:179
Authors:Ang Li, Shanshan Zhao, Xingjun Ma, Mingming Gong, Jianzhong Qi, Rui Zhang, Dacheng Tao, Ramamohanarao Kotagiri
Title: Short-Term and Long-Term Context Aggregation Network for Video Inpainting
Abstract:
Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal. However, existing methods either suffer from inaccurate short-term context aggregation or rarely explore long-term frame information. In this work, we present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting. In the encoding stage, we propose boundary-aware short-term context aggregation, which aligns and aggregates, from neighbor frames, local regions that are closely related to the boundary context of missing regions into the target frame. Furthermore, we propose dynamic long-term context aggregation to globally refine the feature map generated in the encoding stage using long-term frame features, which are dynamically updated throughout the inpainting process. Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed."



Paperid:180
Authors:Juan Du, Rui Wang, Daniel Cremers
Title: DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF Relocalization
Abstract:
For relocalization in large-scale point clouds, we propose the first approach that unifies global place recognition and local 6DoF pose refinement. To this end, we design a Siamese network that jointly learns 3D local feature detection and description directly from raw 3D points. It integrates FlexConv and Squeeze-and-Excitation (SE) to assure that the learned local descriptor captures multi-level geometric information and channel-wise relations. For detecting 3D keypoints we predict the discriminativeness of the local descriptors in an unsupervised manner. We generate the global descriptor by directly aggregating the learned local descriptors with an effective attention mechanism. In this way, local and global 3D descriptors are inferred in one single forward pass. Experiments on various benchmarks demonstrate that our method achieves competitive results for both global point cloud retrieval and local point cloud registration in comparison to state-of-the-art approaches. To validate the generalizability and robustness of our 3D keypoints, we demonstrate that our method also performs favorably without fine-tuning on the registration of point clouds that were generated by a visual SLAM system. Code and related materials are available at https://vision.in.tum.de/research/vslam/dh3d."



Paperid:181
Authors:Xiaobin Hu, Wenqi Ren, John LaMaster, Xiaochun Cao, Xiaoming Li, Zechao Li, Bjoern Menze, Wei Liu
Title: Face Super-Resolution Guided by 3D Facial Priors
Abstract:
State-of-the-art face super-resolution methods employ deep convolutional neural networks to learn a mapping between low- and high-resolution facial patterns by exploring local appearance knowledge. However, most of these methods do not well exploit facial structures and identity information, and struggle to deal with facial images that exhibit large pose variations. In this paper, we propose a novel face super-resolution method that explicitly incorporates 3D facial priors which grasp the sharp facial structures. Our work is the first to explore 3D morphable knowledge based on the fusion of parametric descriptions of face attributes (e.g., identity, facial expression, texture, illumination, and face pose). Furthermore, the priors can easily be incorporated into any networks and are extremely efficient in improving the performance and accelerating the convergence speed. Firstly, a 3D face rendering branch is set up to obtain 3D priors of salient facial structures and identity knowledge. Secondly, the Spatial Attention Module is used to better exploit this hierarchical information (i.e., intensity similarity, 3D facial structure, and identity content) for the super-resolution problem. Extensive experiments demonstrate that the proposed 3D priors achieve superior face super-resolution results over the state-of-the-arts."



Paperid:182
Authors:Yabin Zhang, Bin Deng, Kui Jia, Lei Zhang
Title: Label Propagation with Augmented Anchors: A Simple Semi-Supervised Learning baseline for Unsupervised Domain Adaptation
Abstract:
Motivated by the problem relatedness between unsupervised domain adaptation (UDA) and semi-supervised learning (SSL), many state-of-the-art UDA methods adopt SSL principles (e.g., the cluster assumption) as their learning ingredients. However, they tend to overlook the very domain-shift nature of UDA. In this work, we take a step further to study the proper extensions of SSL techniques for UDA. Taking the algorithm of label propagation (LP) as an example, we analyze the challenges of adopting LP to UDA and theoretically analyze the conditions of affinity graph/matrix construction in order to achieve better propagation of true labels to unlabeled instances. Our analysis suggests a new algorithm of Label Propagation with Augmented Anchors (A$^2$LP), which could potentially improve LP via generation of unlabeled virtual instances (i.e., the augmented anchors) with high-confidence label predictions. To make the proposed A$^2$LP useful for UDA, we propose empirical schemes to generate such virtual instances. The proposed schemes also tackle the domain-shift challenge of UDA by alternating between pseudo labeling via A$^2$LP and domain-invariant feature learning. Experiments show that such a simple SSL extension improves over representative UDA methods of domain-invariant feature learning, and could empower two state-of-the-art methods on benchmark UDA datasets. Our results show the value of further investigation on SSL techniques for UDA problems."



Paperid:183
Authors:Chenxi Liu, Piotr Dollár, Kaiming He, Ross Girshick, Alan Yuille, Saining Xie
Title: Are Labels Necessary for Neural Architecture Search?
Abstract:
Existing neural network architectures in computer vision --- whether designed by humans or by machines --- were typically found using both images and their associated labels. In this paper, we ask the question: can we find high-quality neural architectures using only images, but no human-annotated labels? To answer this question, we first define a new setup called Unsupervised Neural Architecture Search (UnNAS). We then conduct two sets of experiments. In sample-based experiments, we train a large number (500) of diverse architectures with either supervised or unsupervised objectives, and found that the architecture rankings produced with and without labels are highly correlated. In search-based experiments, we run a well-established NAS algorithm (DARTS) using various unsupervised objectives, and report that the architectures searched without labels can be competitive to their counterparts searched with labels. Together, these results reveal the potentially surprising finding that labels are not necessary, and the image statistics alone may be sufficient to identify good neural architectures. "



Paperid:184
Authors:Haoyang Wang, Riza Alp Güler, Iasonas Kokkinos, George Papandreou, Stefanos Zafeiriou
Title: BLSM: A Bone-Level Skinned Model of the Human Mesh
Abstract:
We introduce BLSM, a bone-level skinned model of the human body mesh where bone scales are set prior to template synthesis, rather than the common, inverse practice. BLSM first sets bone lengths and joint angles to specify the skeleton, then specifies identity-specific surface variation, and finally bundles them together through linear blend skinning. We design these steps by constraining the joint angles to respect the kinematic constraints of the human body and by using accurate mesh convolution-based networks to capture identity-specific surface variation. We provide quantitative results on the problem of reconstructing a collection of 3D human scans, and show that we obtain improvements in reconstruction accuracy when comparing to a SMPL-type baseline. Our decoupled bone and shape representation also allows for out-of-box integration with standard graphics packages like Unity, facilitating full-body AR effects and image-driven character animation. Additional results and demos are available from the project webpage: http://arielai.com/blsm"



Paperid:185
Authors:Arman Afrasiyabi, Jean-François Lalonde, Christian Gagné
Title: Associative Alignment for Few-shot Image Classification
Abstract:
Few-shot image classification aims at training a model from only a few examples for each of the ``novel'' classes. This paper proposes the idea of associative alignment for leveraging part of the base data by aligning the novel training instances to the closely related ones in the base training set. This expands the size of the effective novel training set by adding extra ``related base'' instances to the few novel ones, thereby allowing a constructive fine-tuning. We propose two associative alignment strategies: 1) a metric-learning loss for minimizing the distance between related base samples and the centroid of novel instances in the feature space, and 2) a conditional adversarial alignment loss based on the Wasserstein distance. Experiments on four standard datasets and three backbones demonstrate that combining our centroid-based alignment loss results in absolute accuracy improvements of 4.4%, 1.2%, and 6.2% in 5-shot learning over the state of the art for object recognition, fine-grained classification, and cross-domain adaptation, respectively. "



Paperid:186
Authors:Dvir Ginzburg, Dan Raviv
Title: Cyclic Functional Mapping: Self-supervised Correspondence between Non-isometric Deformable Shapes
Abstract:
We present the first utterly self-supervised network for dense correspondence mapping between non-isometric shapes. The task of alignment in non-Euclidean domains is one of the most fundamental and crucial problems in computer vision. As 3D scanners can generate highly complex and dense models, the mission of finding dense mappings between those models is vital. The novelty of our solution is based on a cyclic mapping between metric spaces, where the distance between a pair of points should remain invariant after the full cycle. As the same learnable rules that generate the point-wise descriptors apply in both directions, the network learns invariant structures without any labels while coping with non-isometric deformations. We show here state-of-the-art-results by a large margin for a variety of tasks compared to known self-supervised and supervised methods."



Paperid:187
Authors:Jennifer J. Sun, Jiaping Zhao, Liang-Chieh Chen, Florian Schroff, Hartwig Adam, Ting Liu
Title: View-Invariant Probabilistic Embedding for Human Pose
Abstract:
Depictions of similar human body configurations can vary with changing viewpoints. Using only 2D information, we would like to enable vision algorithms to recognize similarity in human body poses across multiple views. This ability is useful for analyzing body movements and human behaviors in images and videos. In this paper, we propose an approach for learning a compact view-invariant embedding space from 2D joint keypoints alone, without explicitly predicting 3D poses. Since 2D poses are projected from 3D space, they have an inherent ambiguity, which is difficult to represent through a deterministic mapping. Hence, we use probabilistic embeddings to model this input uncertainty. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 2D-to-3D pose lifting models. We also demonstrate the effectiveness of applying our embeddings to view-invariant action recognition and video alignment. Our code will be released for research."



Paperid:188
Authors:Davis Rempe, Leonidas J. Guibas, Aaron Hertzmann, Bryan Russell, Ruben Villegas, Jimei Yang
Title: Contact and Human Dynamics from Monocular Video
Abstract:
Existing deep models predict 2D and 3D kinematic poses from video that are approximately accurate, but contain visible errors that violate physical constraints, such as feet penetrating the ground and bodies leaning at extreme angles. In this paper, we present a physics-based method for inferring 3D human motion from video sequences that takes initial 2D and 3D pose estimates as input. We first estimate ground contact timings with a novel prediction network which is trained without hand-labeled data. A physics-based trajectory optimization then solves for a physically-plausible motion, based on the inputs. We show this process produces motions that are significantly more realistic than those from purely kinematic methods, substantially improving quantitative measures of both kinematic and dynamic plausibility. We demonstrate our method on character animation and pose estimation tasks on dynamic motions of dancing and sports with complex contact patterns."



Paperid:189
Authors:Wenxuan Wu, Zhi Yuan Wang, Zhuwen Li, Wei Liu, Li Fuxin
Title: PointPWC-Net: Cost Volume on Point Clouds for (Self-)Supervised Scene Flow Estimation
Abstract:
We propose a novel end-to-end deep scene flow model, called PointPWC-Net, that directly processes 3D point cloud scenes with large motions in a coarse-to-fine fashion. Flow computed at the coarse level is upsampled and warped to a finer level, enabling the algorithm to accommodate for large motion without a prohibitive search space. We introduce novel cost volume, upsampling, and warping layers to efficiently handle 3D point cloud data. Unlike traditional cost volumes that require exhaustively computing all the cost values on a high-dimensional grid, our point-based formulation discretizes the cost volume onto input 3D points, and a PointConv operation efficiently computes convolutions on the cost volume. Experiment results on FlyingThings3D and KITTI outperform the state-of-the-art by a large margin. We further explore novel self-supervised losses to train our model and achieve comparable results to state-of-the-art trained with supervised loss. Without any fine-tuning, our method also shows great generalization ability on the KITTI Scene Flow 2015 dataset, outperforming all previous methods."



Paperid:190
Authors:Philipp Erler, Paul Guerrero, Stefan Ohrhallinger, Niloy J. Mitra, Michael Wimmer
Title: Points2Surf Learning Implicit Surfaces from Point Clouds
Abstract:
A key step in any scanning-based asset creation workflow is to convert unordered point clouds to a surface. Classical methods (e.g. Poisson reconstruction) start to degrade in the presence of noisy and partial scans. Hence, deep learning based methods have recently been proposed to produce complete surfaces, even from partial scans. However, such data-driven methods struggle to generalize to new shapes with large geometric and topological variations. We present Points2Surf, a novel patch-based learning framework that produces accurate surfaces directly from raw scans without normals. Learning a prior over a combination of detailed local patches and coarse global information improves generalization performance and reconstruction accuracy. Our extensive comparison on both synthetic and real data demonstrates a clear advantage of our method over state-of-the-art alternatives on previously unseen classes (on average, Points2Surf brings down reconstruction error by 30% over SPR and by 270%+ over deep learning based SotA methods) at the cost of longer computation times and a slight increase in small-scale topological noise in some cases. Our source code, pre-trained model, and dataset are available at: https://github.com/ErlerPhilipp/points2surf"



Paperid:191
Authors:Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, Yang Wang
Title: Few-Shot Scene-Adaptive Anomaly Detection
Abstract:
We address the problem of anomaly detection in videos. The goal is to identify unusual behaviours automatically by learning exclusively from normal videos. Most existing approaches are usually data-hungry and have limited generalization abilities. They usually need to be trained on a large number of videos from a target scene to achieve good results in that scene. In this paper, we propose a novel few-shot scene-adaptive anomaly detection problem to address the limitations of previous approaches. Our goal is to learn to detect anomalies in a previously unseen scene with only a few frames. A reliable solution for this new problem will have huge potential in real-world applications since it is expensive to collect a massive amount of data for each target scene. We propose a meta-learning based approach for solving this new problem; extensive experimental results demonstrate the effectiveness of our proposed method. "



Paperid:192
Authors:Bindita Chaudhuri, Noranart Vesdapunt, Linda Shapiro, Baoyuan Wang
Title: Personalized Face Modeling for Improved Face Reconstruction and Motion Retargeting
Abstract:
Traditional methods for image-based 3D face reconstruction and facial motion retargeting fit a 3D morphable model (3DMM) to the face, which has limited modeling capacity and fail to generalize well to in-the-wild data. Use of deformation transfer or multilinear tensor as a personalized 3DMM for blendshape interpolation does not address the fact that facial expressions result in different local and global skin deformations in different persons. Moreover, existing methods learn a single albedo per user which is not enough to capture the expression-specific skin reflectance variations. We propose an end-to-end framework that jointly learns a personalized face model per user and per-frame facial motion parameters from a large corpus of in-the-wild videos of user expressions. Specifically, we learn user-specific expression blendshapes and dynamic (expression-specific) albedo maps by predicting personalized corrections on top of a 3DMM prior. We introduce novel training constraints to ensure that the corrected blendshapes retain their semantic meanings and the reconstructed geometry is disentangled from the albedo. Experimental results show that our personalization accurately captures fine-grained facial dynamics in a wide range of conditions and efficiently decouples the learned face model from facial motion, resulting in more accurate face reconstruction and facial motion retargeting compared to state-of-the-art methods."



Paperid:193
Authors:Urbano Miguel Nunes, Yiannis Demiris
Title: Entropy Minimisation Framework for Event-based Vision Model Estimation
Abstract:
We propose a novel EMin framework for event-based vision model estimation. The framework extends previous event-based motion compensation algorithms to handle models whose outputs have arbitrary dimensions. The main motivation comes from estimating motion from events directly in 3D space (e. g. events augmented with depth), without projecting them onto an image plane. This is achieved by modelling the event alignment according to candidate parameters and minimising the resultant dispersion. We provide a family of suitable entropy loss functions and an efficient approximation whose complexity is only linear with the number of events (e. g. the complexity does not depend on the number of image pixels). The framework is evaluated on several motion estimation problems, including optical flow and rotational motion. As proof of concept, we also test our framework on 6-DOF estimation by performing the optimisation directly in 3D space."



Paperid:194
Authors:Luyang Zhu, Konstantinos Rematas, Brian Curless, Steven M. Seitz, Ira Kemelmacher-Shlizerman
Title: Reconstructing NBA Players
Abstract:
Great progress has been made in 3D body pose and shape estimation from single photos. Yet, state-of-the-art results still suffer from errors due to challenging body poses, modeling clothing, and self occlusions. The domain of basketball games is particularly challenging, due to all of these factors. In this paper, we introduce a new approach for reconstruction of basketball players, that outperforms the state-of-the-art. Key to our approach is new approach for creating poseable, skinned models of NBA players, and a large database of meshes (derived from the NBA2K19 video game), that we are releasing to the research community. Based on these models, we introduce a new method that takes as input a single photo of a clothed player performing any basketball pose and outputs a high resolution mesh and pose of that player. We compare to state of the art methods for shape generation and show significant improvement in the results. "



Paperid:195
Authors:Zhiming Chen, Kean Chen, Weiyao Lin, John See, Hui Yu, Yan Ke, Cong Yang
Title: PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments
Abstract:
Object detection using an oriented bounding box (OBB) can better target rotated objects by reducing the overlap with background areas. Existing OBB approaches are mostly built on horizontal bounding box detectors by introducing an additional angle dimension optimized by a distance loss. However, as the distance loss only minimizes the angle error of the OBB and that it loosely correlates to the IoU, it is insensitive to objects with high aspect ratios. Therefore, a novel loss, Pixels-IoU (PIoU) Loss, is formulated to exploit both the angle and IoU for accurate OBB regression. The PIoU loss is derived from IoU metric with a pixel-wise form, which is simple and suitable for both horizontal and oriented bounding box. To demonstrate its effectiveness, we evaluate the PIoU loss on both anchor-based and anchor-free frameworks. The experimental results show that PIoU loss can dramatically improve the performance of OBB detectors, particularly on objects with high aspect ratios and complex backgrounds. Besides, previous evaluation datasets did not include scenarios where the objects have high aspect ratios, hence a new dataset, Retail50K, is introduced to encourage the community to adapt OBB detectors for more complex environments."



Paperid:196
Authors:Sucheng Ren, Chu Han, Xin Yang, Guoqiang Han, Shengfeng He
Title: TENet: Triple Excitation Network for Video Salient Object Detection
Abstract:
In this paper, we propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD) from three aspects, spatial, temporal, and online excitations. These excitation mechanisms are designed following the spirit of curriculum learning and aim to reduce learning ambiguities at the beginning of training by selectively exciting feature activations using ground truth. Then we gradually reduce the weight of ground truth excitations by a curriculum rate and replace it by a curriculum complementary map for better and faster convergence. In particular, the spatial excitation strengthens feature activations for clear object boundaries, while the temporal excitation imposes motions to emphasize spatio-temporal salient regions. Spatial and temporal excitations can combat the saliency shifting problem and conflict between spatial and temporal features of VSOD. Furthermore, our semi-curriculum learning design enables the first online refinement strategy for VSOD, which allows exciting and boosting saliency responses during testing without re-training. The proposed triple excitations can easily plug in different VSOD methods. Extensive experiments show the effectiveness of all three excitation methods and the proposed method outperforms state-of-the-art image and video salient object detection methods."



Paperid:197
Authors:Wei-Chiu Ma, Shenlong Wang, Jiayuan Gu, Sivabalan Manivasagam, Antonio Torralba, Raquel Urtasun
Title: Deep Feedback Inverse Problem Solver
Abstract:
We present an efficient, effective, and generic approach towards solving inverse problems. The key idea is to leverage the feedback signal provided by the forward process and learn an iterative update model. Specifically, in each iteration, the neural network takes the feedback as input and outputs an update on current estimation. Our approach does not have any restrictions on the forward process; it does not require any prior knowledge either. Through the feedback information, our model not only can produce accurate estimations that are coherent to the input observation but also is capable of recovering from early incorrect predictions. We verify the performance of our model over a wide range of inverse problems, including 6-DOF pose estimation, illumination estimation, as well as inverse kinematics. Comparing to traditional optimization-based methods, we can achieve comparable or better performance while being two to three orders of magnitude faster. Compared to deep learning-based approaches, our model consistently improves the performance on all metrics."



Paperid:198
Authors:Liuyu Xiang, Guiguang Ding, Jungong Han
Title: Learning From Multiple Experts: Self-paced Knowledge Distillation for Long-tailed Classification
Abstract:
In real-world scenarios, data tends to exhibit a long-tailed distribution, which increases the difficulty of training deep networks. In this paper, we propose a novel self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME). Our method is inspired by the observation that networks trained on less imbalanced subsets of the distribution often yield better performances than their jointly-trained counterparts. We refer to these models as `Experts’, and the proposed LFME framework aggregates the knowledge from multiple `Experts' to learn a unified student model. Specifically, the proposed framework involves two levels of adaptive learning schedules: Self-paced Expert Selection and Curriculum Instance Selection, so that the knowledge is adaptively transferred to the `Student'. We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods. We also show that our method can be easily plugged into state-of-the-art long-tailed classification algorithms for further improvements."



Paperid:199
Authors:Jiayan Qiu, Yiding Yang, Xinchao Wang, Dacheng Tao
Title: Hallucinating Visual Instances in Total Absentia
Abstract:
In this paper, we investigate a new visual restoration task, termed as hallucinating visual instances in total absentia (HVITA). Unlike conventional image inpainting task that works on images with only part of a visual instance missing, HVITA concerns scenarios where an object is completely absent from the scene. This seemingly minor difference in fact makes the HVITA a much challenging task, as the restoration algorithm would have to not only infer the category of the object in total absentia, but also hallucinate an object of which the appearance is consistent with the background. Towards solving HVITA, we propose an end-to-end deep approach that explicitly looks into the global semantics within the image. Specifically, we transform the input image to a semantic graph, wherein each node corresponds to a detected object in the scene. We then adopt a Graph Convolutional Network on top of the scene graph to estimate the category of the missing object in the masked region, and finally introduce a Generative Adversarial Module to carry out the hallucination. Experiments on COCO, Visual Genome and NYU Depth v2 datasets demonstrate that the proposed approach yields truly encouraging and visually plausible results."



Paperid:200
Authors:Jiayuan Gu, Wei-Chiu Ma, Sivabalan Manivasagam, Wenyuan Zeng, Zihao Wang, Yuwen Xiong, Hao Su, Raquel Urtasun
Title: Weakly-supervised 3D Shape Completion in the Wild
Abstract:
3D shape completion for real data is important but challenging, since partial point clouds acquired by real-world sensors are usually sparse, noisy and unaligned. Different from previous methods, we address the problem of learning 3D complete shape from unaligned and real-world partial point clouds. To this end, we propose an unsupervised method to estimate both 3D canonical shape and 6-DoF pose for alignment, given multiple partial observations associated with the same instance. The network jointly optimizes canonical shapes and poses with multi-view geometry constraints during training, and can infer the complete shape given a single partial point cloud. Moreover, learned pose estimation can facilitate partial point cloud registration. Experiments on both synthetic and real data show that it is feasible and promising to learn 3D shape completion through large-scale data without shape and pose supervision."



Paperid:201
Authors:Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, Yunliang Jiang
Title: DTVNet: Dynamic Time-lapse Video Generation via Single Still Image
Abstract:
This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: mph{Optical Flow Encoder} (OFE) and mph{Dynamic Video Generator} (DVG). The OFE maps a sequence of optical flow maps to a mph{normalized motion vector} that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the mph{motion stream} introduces multiple mph{adaptive instance normalization} (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different mph{normalized motion vectors} based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information."



Paperid:202
Authors:Lijun Wang, Jianming Zhang, Yifan Wang, Huchuan Lu, Xiang Ruan
Title: CLIFFNet for Monocular Depth Estimation with Hierarchical Embedding Loss
Abstract:
This paper proposes a hierarchical loss for monocular depth estimation, which measures the differences between the prediction and ground truth in hierarchical embedding spaces of depth maps. In order to find an appropriate embedding space, we design different architectures for hierarchical embedding generators (HEGs) and explore relevant tasks to train their parameters. Compared to conventional depth losses manually defined on a per-pixel basis, the proposed hierarchical loss can be learned in a data-driven manner. As verified by our experiments, the hierarchical loss even learned without additional labels can capture multi-scale context information, is more robust to local outliers, and thus delivers superior performance. To further improve depth accuracy, a cross level identity feature fusion network (CLIFFNet) is proposed, where low-level features with finer details are refined using more reliable high-level cues. Through end-to-end training, CLIFFNet can learn to select the optimal combinations of low-level and high-level features, leading to more effective cross level feature fusion. When trained using the proposed hierarchical loss, CLIFFNet sets a new state of the art on popular depth estimation benchmarks."



Paperid:203
Authors:Zongxin Yang, Yunchao Wei, Yi Yang
Title: Collaborative Video Object Segmentation by Foreground-Background Integration
Abstract:
This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Different from previous practices that only explore the embedding learning using pixels from foreground object (s), we consider background should be equally treated and thus propose Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. Our CFBI implicitly imposes the feature embedding from the target foreground object and its corresponding background to be contrastive, promoting the segmentation results accordingly. With the feature embedding from both foreground and background, our CFBI performs the matching process between the reference and the predicted sequence from both pixel and instance levels, making the CFBI be robust to various object scales. We conduct extensive experiments on three popular benchmarks, ie, DAVIS 2016, DAVIS 2017, and YouTube-VOS. Our CFBI achieves the performance (J&F) of 89.4%, 81.9%, and 81.4%, respectively, outperforming all the other state-of-the-art methods. Code: \url{https://github.com/z-x-yang/CFBI}."



Paperid:204
Authors:Titir Dutta, Anurag Singh, Soma Biswas
Title: Adaptive Margin Diversity Regularizer for handling Data Imbalance in Zero-Shot SBIR
Abstract:
Data from new categories are continuously being discovered, which has sparked significant amount of research in developing approaches which generalizes to previously unseen categories, i.e. zero-shot setting. Zero-shot sketch-based image retrieval~(ZS-SBIR) is one such problem in the context of cross-domain retrieval, which has received lot of attention due to its various real-life applications. Since most real-world training data have a fair amount of imbalance; in this work, for the first time in literature, we extensively study the effect of training data imbalance on the generalization to unseen categories, with ZS-SBIR as the application area. We evaluate several state-of-the-art data imbalance mitigating techniques and analyze their results. Furthermore, we propose a novel framework AMDReg (Adaptive Margin Diversity Regularizer), which ensures that the embeddings of the sketch and images in the latent space are not only semantically meaningful, but they also are separated according to their class-representations in the training set. The proposed approach is model-independent, and it can be incorporated seamlessly with several state-of-the-art ZS-SBIR methods to improve their performance under imbalanced condition. Extensive experiments and analysis justifies the effectiveness of the proposed AMDReg for mitigating the effect of data imbalance for generalization to unseen classes in ZS-SBIR."



Paperid:205
Authors:Xucong Zhang, Seonwook Park, Thabo Beeler, Derek Bradley, Siyu Tang , Otmar Hilliges
Title: ETH-XGaze: A Large Scale Dataset for Gaze Estimation under Extreme Head Pose and Gaze Variation
Abstract:
Gaze estimation is a fundamental task in many applications of computer vision, human computer interaction and robotics. Many state-of-the-art methods are trained and tested on custom datasets, making comparison across methods challenging. Furthermore, existing gaze estimation datasets have limited head pose and gaze variations, and the evaluations are conducted using different protocols and metrics. In this paper, we propose a new gaze estimation dataset called ETH-XGaze, consisting of over one million high-resolution images of varying gaze under extreme head poses. We collect this dataset from 110 participants with a custom hardware setup including 18 digital SLR cameras and adjustable illumination conditions, and a calibrated system to record ground truth gaze targets. We show that our dataset can significantly improve the robustness of gaze estimation methods across different head poses and gaze angles. Additionally, we define a standardized experimental protocol and evaluation metric on ETH-XGaze, to better unify gaze estimation research going forward. The dataset and benchmark website are available at https://ait.ethz.ch/projects/2020/ETH-XGaze"



Paperid:206
Authors:Viktor Larsson, Nicolas Zobernig, Kasim Taskin, Marc Pollefeys
Title: Calibration-free Structure-from-Motion with Calibrated Radial Trifocal Tensors
Abstract:
In this paper we consider the problem of Structure-from-Motion from images with unknown intrinsic calibration. Instead of estimating the internal camera parameters through some self-calibration procedure, we propose to use a subset of the reprojection constraints that is invariant to radial displacement. This allows us to recover metric 3D reconstructions without explicitly estimating the cameras' focal length or radial distortion parameters. The weaker projection model makes initializing the reconstruction especially difficult. To handle this additional challenge we propose two novel minimal solvers for radial trifocal tensor estimation. We evaluate our approach on real images and show that even for extreme optical systems, such as fisheye or catadioptric, we are able to get accurate reconstructions without performing any calibration."



Paperid:207
Authors:Santhosh K. Ramakrishnan, Ziad Al-Halah, Kristen Grauman
Title: Occupancy Anticipation for Efficient Exploration and Navigation
Abstract:
State-of-the-art navigation methods leverage a spatial memory to generalize to new environments, but their occupancy maps are limited to capturing the geometric structures directly observed by the agent. We propose occupancy anticipation, where the agent uses its egocentric RGB-D observations to infer the occupancy state beyond the visible regions. In doing so, the agent builds its spatial awareness more rapidly, which facilitates efficient exploration and navigation in 3D environments. By exploiting context in both the egocentric views and top-down maps our model successfully anticipates a broader map of the environment, with performance significantly better than strong baselines. Furthermore, when deploying our model for the sequential decision-making tasks of exploration and navigation, we outperform state-of-the-art methods on the Gibson and Matterport3D datasets. Our approach is the winning entry in the 2020 Habitat PointNav Challenge. Project page: http://vision.cs.utexas.edu/projects/occupancy_anticipation"



Paperid:208
Authors:Richard Droste, Jianbo Jiao, J. Alison Noble
Title: Unified Image and Video Saliency Modeling
Abstract:
Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN - in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal."



Paperid:209
Authors:Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
Title: TAO: A Large-Scale Benchmark for Tracking Any Object
Abstract:
For many years, multi-object tracking benchmarks have focused on a handful of categories. Motivated primarily by surveillance and self-driving applications, these datasets provide tracks for people, vehicles, and animals, ignoring the vast majority of objects in the world. By contrast, in the related field of object detection, the introduction of large-scale, diverse datasets (e.g., COCO) have fostered significant progress in developing highly robust solutions. To bridge this gap, we introduce a similarly diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. Importantly, we adopt a bottom-up approach for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks. To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum. Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets. To ensure scalability of annotation, we employ a federated approach that focuses manual effort on labeling tracks for those relevant objects in a video (e.g., those that move). We perform an extensive evaluation of state-of-the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world. In particular, we show that existing single- and multi-object trackers struggle when applied to this scenario in the wild, and that detection-based, multi-object trackers are in fact competitive with user-initialized ones. We hope that our dataset and analysis will boost further progress in the tracking community."



Paperid:210
Authors:Jonathan T. Barron
Title: A Generalization of Otsu’s Method and Minimum Error Thresholding
Abstract:
We present Generalized Histogram Thresholding (GHT), a simple, fast, and effective technique for histogram-based image thresholding. GHT works by performing approximate maximum a posteriori estimation of a mixture of Gaussians with appropriate priors. We demonstrate that GHT subsumes three classic thresholding techniques as special cases: Otsu's method, Minimum Error Thresholding (MET), and weighted percentile thresholding. GHT thereby enables the continuous interpolation between those three algorithms, which allows thresholding accuracy to be improved significantly. GHT also provides a clarifying interpretation of the common practice of coarsening a histogram's bin width during thresholding. We show that GHT outperforms or matches the performance of all algorithms on a recent challenge for handwritten document image binarization (including deep neural networks trained to produce per-pixel binarizations), and can be implemented in a dozen lines of code or as a trivial modification to Otsu's method or MET."



Paperid:211
Authors:Unnat Jain, Luca Weihs, Eric Kolve, Ali Farhadi, Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing
Title: A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks
Abstract:
Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task’s difficulty outpaces a single agent’s abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at https://unnat.github.io/cordial-sync ."



Paperid:212
Authors:Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung, Sylvain Gelly, Neil Houlsby
Title: Big Transfer (BiT): General Visual Representation Learning
Abstract:
Transfer of pre-trained representations improves sample efficiency and simplifies hyperparameter tuning when training deep neural networks for vision. We revisit the paradigm of pre-training on large supervised datasets and fine-tuning the model on a target task. We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). By combining a few carefully selected components, and transferring using a simple heuristic, we achieve strong performance on over 20 datasets. BiT performs well across a surprisingly wide range of data regimes — from 1 example per class to 1 M total examples. BiT achieves 87.5% top-1 accuracy on ILSVRC-2012, 99.4% on CIFAR-10, and 76.3% on the 19 task Visual Task Adaptation Benchmark (VTAB). On small datasets, BiT attains 76.8% on ILSVRC-2012 with 10 examples per class, and 97.0% on CIFAR-10 with 10 examples per class. We conduct detailed analysis of the main components that lead to high transfer performance."



Paperid:213
Authors:Jae Sung Park, Chandra Bhagavatula, Roozbeh Mottaghi, Ali Farhadi, Yejin Choi
Title: VisualCOMET: Reasoning about the Dynamic Context of a Still Image
Abstract:
Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose Visual COMET, the novel framework of visual common-sense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 59,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives."



Paperid:214
Authors:Hongguang Zhang, Li Zhang, Xiaojuan Qi, Hongdong Li, Philip H. S. Torr, Piotr Koniusz
Title: Few-shot Action Recognition with Permutation-invariant Attention
Abstract:
Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift--the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets."



Paperid:215
Authors:Youngjae Yu, Jongseok Kim, Heeseung Yun, Jiwan Chung, Gunhee Kim
Title: Character Grounding and Re-Identification in Story of Videos and Text Descriptions
Abstract:
We address character grounding and re-identification in multiple story-based videos like movies and associated text descriptions. In order to solve these related tasks in a mutually rewarding way, we propose a model named Character in Story Identification Network (CiSIN). Our method builds two semantically informative representations via joint training of multiple objectives for character grounding, video/text re-identification and gender prediction: Visual Track Embedding from videos and Textual Character Embedding from text context. These two representations are learned to retain rich semantic multimodal information that enables even simple MLPs to achieve the state-of-the-art performance on the target tasks. More specifically, our CiSIN model achieves the best performance in the Fill-in the Characters task of LSMDC 2019 challenges. Moreover, it outperforms previous state-of-the-art models in M-VAD Names dataset as a benchmark of multimodal character grounding and re-identification."



Paperid:216
Authors:Wenshuo Ma, Tingzhong Tian, Hang Xu, Yimin Huang, Zhenguo Li
Title: AABO: Adaptive Anchor Box Optimization for Object Detection via Bayesian Sub-sampling
Abstract:
Most state-of-the-art object detection systems follow an anchor-based diagram. Anchor boxes are densely proposed over the images and the network is trained to predict the boxes position offset as well as the classification confidence. Existing systems pre-define anchor box shapes and sizes and ad-hoc heuristic adjustments are used to define the anchor configurations. However, this might be sub-optimal or even wrong when a new dataset or a new model is adopted. In this paper, we study the problem of automatically optimizing anchor boxes for object detection. We first demonstrate that the number of anchors, anchor scales and ratios are crucial factors for a reliable object detection system. By carefully analyzing the existing bounding box patterns on the feature hierarchy, we design a flexible and tight hyper-parameter space for anchor configurations. Then we propose a novel hyper-parameter optimization method named AABO to determine more appropriate anchor boxes for a certain dataset, in which Bayesian Optimization and subsampling method are combined to achieve precise and efficient anchor configuration optimization. Experiments demonstrate the effectiveness of our proposed method on different detectors and datasets, e.g. achieving around 2.4% mAP improvement on COCO, 1.6% on ADE and 1.5% on VG, and the optimal anchors can bring 1.4% ∼ 2.4% mAP improvement on SOTA detectors by only optimizing anchor configurations, e.g. boosting Mask RCNN from 40.3% to 42.3%, and HTC detector from 46.8% to 48.2%."



Paperid:217
Authors:Minchul Kim, Jongchan Park, Seil Na, Chang Min Park, Donggeun Yoo
Title: Learning Visual Context by Comparison
Abstract:
Finding diseases from an X-ray image is an important yet highly challenging task. Current methods for solving this task exploit various characteristics of the chest X-ray image, but one of the most important characteristics is still missing: the necessity of comparison between related regions in an image. In this paper, we present Attend-and-Compare Module (ACM) for capturing the difference between an object of interest and its corresponding context. We show that explicit difference modeling can be very helpful in tasks that require direct comparison between locations from afar. This module can be plugged into existing deep learning models. For evaluation, we apply our module to three chest X-ray recognition tasks and COCO object detection & segmentation tasks and observe consistent improvements across tasks. The code is available at https://github.com/mk-minchul/attend-and-compare."



Paperid:218
Authors:Ali Diba, Mohsen Fayyaz, Vivek Sharma, Manohar Paluri, Jürgen Gall, Rainer Stiefelhagen, Luc Van Gool
Title: Large Scale Holistic Video Understanding
Abstract:
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale ``Holistic Video Understanding Dataset""~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally capture the real-world scenarios.We demonstrate the generalisation capability of HVU on three challenging tasks: 1.) Video classification, 2.) Video captioning and 3.) Video clustering tasks. In particular for video classification, we introduce a new spatio-temporal deep neural network architecture called ``Holistic Appearance and Temporal Network""~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. %The experiments show that HATNet trained on HVU outperforms current state-of-the-art methods on challenging human action datasets: HMDB51, UCF101, and Kinetics. The dataset and codes will be made publicly available. Via our experiments, we validate the idea that holistic representation learning is complementary, and can play a key role in enabling many real-world applications."



Paperid:219
Authors:Krishna Kanth Nakka, Mathieu Salzmann
Title: Indirect Local Attacks for Context-aware Semantic Segmentation Networks
Abstract:
Recently, deep networks have achieved impressive semantic segmentation performance, in particular thanks to their use of larger contextual information. In this paper, we show that the resulting networks are sensitive not only to global adversarial attacks, where perturbations affect the entire input image, but also to indirect local attacks, where the perturbations are confined to a small image region that does not overlap with the area that the attacker aims to fool. To this end, we introduce an indirect attack strategy, namely adaptive local attacks, aiming to find the best image location to perturb, while preserving the labels at this location and producing a realistic-looking segmentation map. Furthermore, we propose attack detection techniques both at the global image level and to obtain a pixel-wise localization of the fooled regions. Our results are unsettling: Because they exploit a larger context, more accurate semantic segmentation networks are more sensitive to indirect local attacks. We believe that our comprehensive analysis will motivate the community to design architectures with contextual dependencies that do not trade off robustness for accuracy."



Paperid:220
Authors:Anita Rau, Guillermo Garcia-Hernando, Danail Stoyanov, Gabriel J. Brostow, Daniyar Turmukhambetov
Title: Predicting Visual Overlap of Images Through Interpretable Non-Metric Box Embeddings
Abstract:
To what extent are two images picturing the same 3D surfaces? Even when this is a known scene, the answer typically requires an expensive search across scale space, with matching and geometric verification of large sets of local features. This expense is further multiplied when a query image is evaluated against a gallery, e.g. in visual relocalization. While we don't obviate the need for geometric verification, we propose an interpretable image-embedding that cuts the search in scale space to essentially a lookup.Our approach measures the asymmetric distance between two images. The model then learns a scene-specific measure of similarity, from training examples with known 3D visible-surface overlaps. The result is that we can quickly identify, for example, which test image is a close-up version of another, and by what scale factor. Subsequently, local features need only be detected at that scale. We validate our scene-specific model by showing how this embedding yields competitive image-matching results, while being simpler, faster, and also interpretable by humans."



Paperid:221
Authors:Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari
Title: Connecting Vision and Language with Localized Narratives
Abstract:
We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning."



Paperid:222
Authors:Kaidi Xu, Gaoyuan Zhang, Sijia Liu, Quanfu Fan, Mengshu Sun, Hongge Chen, Pin-Yu Chen, Yanzhi Wang, Xue Lin
Title: Adversarial T-shirt! Evading Person Detectors in A Physical World
Abstract:
It is known that deep neural networks (DNNs) are vulnerable to adversarial attacks. The so-called physical adversarial examples deceive DNN-based decision makers by attaching adversarial patches to real objects. However, most of the existing works on physical adversarial attacks focus on static objects such as glass frames, stop signs and images attached to cardboard. In this work, we proposed adversarial T-shirts, a robust physical adversarial example for evading person detectors even if it could undergo non-rigid deformation due to a moving person’s pose changes. To the best of our knowledge, this is the first work that models the effect of deformation for designing physical adversarial examples with respect to non-rigid objects such as T-shirts. We show that the proposed method achieves 74% and 57% attack success rates in the digital and physical worlds respectively against YOLOv2. In contrast, the state-of-the-art physical attack method to fool a person detector only achieves 18% attack success rate. Furthermore, by leveraging min-max optimization, we extend our method to the ensemble attack setting against two object detectors YOLO-v2 and Faster R-CNN simultaneously."



Paperid:223
Authors:Sho Inayoshi, Keita Otani, Antonio Tejero-de-Pablos, Tatsuya Harada
Title: Bounding-box Channels for Visual Relationship Detection
Abstract:
Recognizing the relationship between multiple objects in an image is essential for a deeper understanding of the meaning of the image. However, current visual recognition methods are still far from reaching human-level accuracy. Recent approaches have tackled this task by combining image features with semantic and spatial features, but the way they relate them to each other is weak, mostly because the spatial context in the image feature is lost. In this paper, we propose the bounding-box channels, a novel architecture capable of relating the semantic, spatial, and image features strongly. Our network learns bounding-box channels, which are initialized according to the position of the objects and the label of objects, and concatenated to the image features extracted from the objects. Then, they are input together to the relationship estimator. Our method can retain the spatial information in the image features, and strongly associate them with the semantic and spatial features. This way, our method is capable of effectively emphasizing the features in the object area for a better modeling of the relationships within objects. In addition, we experimentally show that our bounding-box channels have a high generalization ability. Our evaluation results show the efficacy of our architecture outperforming previous works in visual relationship detection. "



Paperid:224
Authors:Zuzana Kukelova, Cenek Albl, Akihiro Sugimoto, Konrad Schindler, Tomas Pajdla
Title: Minimal Rolling Shutter Absolute Pose with Unknown Focal Length and Radial Distortion
Abstract:
The internal geometry of most modern consumer cameras is not adequately described by the perspective projection. Almost all cameras exhibit some radial lens distortion and are equipped with electronic rolling shutter that induces distortions when the camera moves during the image capture. When focal length has not been calibrated offline, the parameters that describe the radial and rolling shutter distortions are usually unknown. While for global shutter cameras, minimal solvers for the absolute camera pose and unknown focal length and radial distortion are available, solvers for the rolling shutter were missing. We present the first minimal solutions for the absolute pose of a rolling shutter camera with unknown rolling shutter parameters, focal length, and radial distortion. Our new minimal solvers combine iterative schemes designed for calibrated rolling shutter cameras with fast generalized eigenvalue and Groebner basis solvers. In a series of experiments, with both synthetic and real data, we show that our new solvers provide accurate estimates of the camera pose, rolling shutter parameters, focal length, and radial distortion parameters."



Paperid:225
Authors:Andreas Lugmayr, Martin Danelljan, Luc Van Gool, Radu Timofte
Title: SRFlow: Learning the Super-Resolution Space with Normalizing Flow
Abstract:
Super-resolution is an ill-posed problem, since it allows for multiple predictions for a given low-resolution image. This fundamental fact is largely ignored by state-of-the-art deep learning based approaches. These methods instead train a deterministic mapping using combinations of reconstruction and adversarial losses. In this work, we therefore propose SRFlow: a normalizing flow based super-resolution method capable of learning the conditional distribution of the output given the low-resolution input. Our model is trained in a principled manner using a single loss, namely the negative log-likelihood. SRFlow therefore directly accounts for the ill-posed nature of the problem, and learns to predict diverse photo-realistic high-resolution images. Moreover, we utilize the strong image posterior learned by SRFlow to design flexible image manipulation techniques, capable of enhancing super-resolved images by, e.g., transferring content from other images. We perform extensive experiments on faces, as well as on super-resolution in general. SRFlow outperforms state-of-the-art GAN-based approaches in terms of both PSNR and perceptual quality metrics, while allowing for diversity through the exploration of the space of super-resolved solutions."



Paperid:226
Authors:Wentao Yuan, Benjamin Eckart, Kihwan Kim, Varun Jampani, Dieter Fox , Jan Kautz
Title: DeepGMR: Learning Latent Gaussian Mixture Models for Registration
Abstract:
Point cloud registration is a fundamental problem in 3D computer vision, graphics and robotics. For the last few decades, existing registration algorithms have struggled in situations with large transformations, noise, and time constraints. In this paper, we introduce Deep Gaussian Mixture Registration (DeepGMR), the first learning-based registration method that explicitly leverages a probabilistic registration paradigm by formulating registration as the minimization of KL-divergence between two probability distributions modeled as mixtures of Gaussians. We design a neural network that extracts pose-invariant correspondences between raw point clouds and Gaussian Mixture Model (GMM) parameters and two differentiable compute blocks that recover the optimal transformation from matched GMM parameters. This construction allows the network learn an SE(3)-invariant feature space, producing a global registration method that is real-time, generalizable, and robust to noise. Across synthetic and real-world data, our proposed method shows favorable performance when compared with state-of-the-art geometry-based and learning-based registration methods."



Paperid:227
Authors:Siddharth Ancha, Yaadhav Raaj, Peiyun Hu, Srinivasa G. Narasimhan, David Held
Title: Active Perception using Light Curtains for Autonomous Driving
Abstract:
Most real-world 3D sensors such as LiDARs are passive, meaning that they sense the entire environment, while being decoupled from the recognition system that processes the sensor data. In this work, we propose a method for 3D object recognition using light curtains, a resource-efficient active sensor that measures depth at selected locations in the environment in a controllable manner. Crucially, we propose using prediction uncertainty of a deep learning based 3D point cloud detector to guide active sensing. Given a neural network's uncertainty, we develop a novel optimization algorithm to optimally place light curtains to maximize coverage of uncertain regions. Efficient optimization is achieved by encoding the physical constraints of the device into a constraint graph, which is optimized with dynamic programming. We show how a 3D detector can be trained to detect objects in a scene by sequentially placing uncertainty-guided light curtains to successively improve detection accuracy."



Paperid:228
Authors:Zhe Chen, Shohei Nobuhara, Ko Nishino
Title: Invertible Neural BRDF for Object Inverse Rendering
Abstract:
We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i.e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry. The BRDF is expressed with an invertible neural network, namely, normalizing flow, which provides the expressive power of a high-dimensional representation, computational simplicity of a compact analytical model, and physical plausibility of a real-world BRDF. We extract the latent space of real-world reflectance by conditioning this model, which directly results in a strong reflectance prior. We refer to this model as the invertible neural BRDF model (iBRDF). We also devise a deep illumination prior by leveraging the structural bias of deep neural networks. By integrating this novel BRDF model and reflectance and illumination priors in a MAP estimation formulation, we show that this joint estimation can be computed efficiently with stochastic gradient descent. We experimentally validate the accuracy of the invertible neural BRDF model on a large number of measured data and demonstrate its use in object inverse rendering on a number of synthetic and real images. The results show new ways in which deep neural networks can help solve challenging radiometric inverse problems."



Paperid:229
Authors:Wenfeng Luo, Meng Yang
Title: Semi-supervised Semantic Segmentation via Strong-weak Dual-branch Network
Abstract:
While existing works have explored a variety of techniques to push the envelop of weakly-supervised semantic segmentation, there is still a significant gap compared to the supervised methods. In real-world application, besides massive amount of weakly-supervised data there are usually a few available pixel-level annotations, based on which semi-supervised track becomes a promising way for semantic segmentation. Current methods simply bundle these two different sets of annotations together to train a segmentation network. However, we discover that such treatment is problematic and achieves even worse results than just using strong labels, which indicates the misuse of the weak ones. To fully explore the potential of the weak labels, we propose to impose separate treatments of strong and weak annotations via a strong-weak dual-branch network, which discriminates the massive inaccurate weak supervisions from those strong ones. We design a shared network component to exploit the joint discrimination of strong and weak annotations; meanwhile, the proposed dual branches separately handle full and weak supervised learning and effectively eliminate their mutual interference. This simple architecture requires only slight additional computational costs during training yet brings significant improvements over the previous methods. Experiments on two standard benchmark datasets show the effectiveness of the proposed method."



Paperid:230
Authors:Yuzhi Wang, Haibin Huang, Qin Xu, Jiaming Liu, Yiqun Liu, Jue Wang
Title: Practical Deep Raw Image Denoising on Mobile Devices
Abstract:
Deep learning-based image denoising approaches have been extensively studied in recent years, prevailing in many public benchmark datasets. However, the stat-of-the-art networks are computationally too expensive to be directly applied on mobile devices. In this work, we propose a light-weight, efficient neural network-based raw image denoiser that runs smoothly on mainstream mobile devices, and produces high quality denoising results. Our key insights are twofold:(1) by measuring and estimating sensor noise level, a smaller network trained on synthetic sensor-specific data can out-perform larger ones trained on general data;(2) the large noise level variation under different ISO settings can be removed by a novel k-Sigma Transform, allowing a small network to efficiently handle a wide range of noise levels. We conduct extensive experiments to demonstrate the efficiency and accuracy of our approach. Our proposed mobile-friendly denoising model runs at ~70 milliseconds per megapixel on Qualcomm Snapdragon 855 chipset, and it is the basis of the night shot feature of several flagship smartphones released in 2019. "



Paperid:231
Authors:Changan Chen, Unnat Jain, Carl Schissler, Sebastia Vicenc Amengual Gari, Ziad Al-Halah, Vamsi Krishna Ithapu, Philip Robinson, and Kristen Grauman
Title: SoundSpaces: Audio-Visual Navigation in 3D Environments
Abstract:
Moving around in the world is naturally a multi-sensory experience, but today's embodied agents are deaf - restricted to solely their visual perception of the environment. We introduce audio-visual navigation for complex, acoustically and visually realistic 3D environments. By both seeing and hearing, the agent must learn to navigate to an audio-based target. We develop a multi-modal deep reinforcement learning pipeline to train navigation policies end-to-end from a stream of egocentric audio-visual observations, allowing the agent to (1) discover elements of the geometry of the physical space indicated by the reverberating audio and (2) detect and follow sound-emitting targets. We further introduce audio renderings based on geometrical acoustic simulations for a set of publicly available 3D assets and instrument AI-Habitat to support the new sensor, making it possible to insert arbitrary sound sources in an array of apartment, office, and hotel environments. Our results show that audio greatly benefits embodied visual navigation in 3D spaces."



Paperid:232
Authors:Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, Gang Hua
Title: Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization
Abstract:
Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods."



Paperid:233
Authors:Lvmin Zhang, Chengze Li, Yi JI, Chunping Liu, Tien-tsin Wong
Title: Erasing Appearance Preservation in Optimization-based Smoothing
Abstract:
Optimization-based image smoothing is routinely formulated as the game between a smoothing energy and an appearance preservation energy. Achieving adequate smoothing is a fundamental goal of these image smoothing algorithms. We show that partially ""erasing"" the appearance preservation facilitate adequate image smoothing. In this paper, we call this manipulation as Erasing Appearance Preservation (EAP). We conduct an user study, allowing users to indicate the ""erasing"" positions by drawing scribbles interactively, to verify the correctness and effectiveness of EAP. We observe the characteristics of human-indicated ""erasing"" positions, and then formulate a simple and effective 0-1 knapsack to automatically synthesize the ""erasing"" positions. We test our synthesized erasing positions in a majority of image smoothing methods. Experimental results and large-scale perceptual human judgments show that the EAP solution tends to encourage the pattern separation or elimination capabilities of image smoothing algorithms. We further study the performance of the EAP solution in many image decomposition problems to decompose textures, shadows, and the challenging specular reflections. We also present examinations of diversiform image manipulation applications like texture removal, retexturing, intrinsic decomposition, layer extraction, recoloring, material manipulation, etc. Due to the widespread applicability of image smoothing, the EAP is also likely to be used in more image editing applications."



Paperid:234
Authors:Tsu-Jui Fu, Xin Eric Wang, Matthew F. Peterson,Scott T. Grafton, Miguel P. Eckstein, William Yang Wang
Title: Counterfactual Vision-and-Language Navigation via Adversarial Path Sampler
Abstract:
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal by grounding natural language instructions to the visual surroundings. One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments. In this paper, we explore the use of counterfactual thinking as a human-inspired data augmentation method that results in robust models. Counterfactual thinking is a concept that describes the human propensity to create possible alternatives to life events that have already occurred. We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data. In particular, we present a model-agnostic adversarial path sampler (APS) that learns to sample challenging paths that force the navigator to improve based on the navigation performance. APS also serves to do pre-exploration of unseen environments to strengthen the model's ability to generalize. We evaluate the influence of APS on the performance of different VLN baseline models using the room-to-room dataset (R2R). The results show that the adversarial training process with our proposed APS benefits VLN models under both seen and unseen environments. And the pre-exploration process can further gain additional improvements under unseen environments."



Paperid:235
Authors:Tatsumi Uezato, Danfeng Hong, Naoto Yokoya, Wei He
Title: Guided Deep Decoder: Unsupervised Image Pair Fusion
Abstract:
The fusion of input and guidance images that have a tradeoff in their information (e.g., hyperspectral and RGB image fusion or pansharpening) can be interpreted as one general problem. However, previous studies applied a task-specific handcrafted prior and did not address the problems with a unified approach. To address this limitation, in this study, we propose a guided deep decoder network as a general prior. The proposed network is composed of an encoder-decoder network that exploits multi-scale features of a guidance image and a deep decoder network that generates an output image. The two networks are connected by feature refinement units to embed the multi-scale features of the guidance image into the deep decoder network. The proposed network allows the network parameters to be optimized in an unsupervised way without training data. Our results show that the proposed network can achieve state-of-the-art performance in various image fusion problems."



Paperid:236
Authors:Jonghwa Yim, Jisung Yoo, Won-joon Do, Beomsu Kim, Jihwan Choe
Title: Filter Style Transfer between Photos
Abstract:
Over the past few years, image-to-image style transfer has risen to the frontiers of neural image processing. While conventional methods were successful in various tasks such as color and texture transfer between images, none could effectively work with the custom filter effects that are applied by users through various platforms like Instagram. In this paper, we introduce a new concept of style transfer, Filter Style Transfer (FST). Unlike conventional style transfer, new technique FST can extract and transfer custom filter style from a filtered style image to a content image. FST first infers the original image from a filtered reference via image-to-image translation. Then it estimates filter parameters from the difference between them. To resolve the ill-posed nature of reconstructing the original image from the reference, we represent each pixel color of an image to class mean and deviation. Besides, to handle the intra-class color variation, we propose an uncertainty based weighted least square method for restoring an original image. To the best of our knowledge, FST is the first style transfer method that can transfer custom filter effects between FHD image under 2ms on a mobile device without any textual context loss."



Paperid:237
Authors:Linpu Fang, Xingyan Liu, Li Liu, Hang Xu, Wenxiong Kang
Title: JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image
Abstract:
State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions, including voxel-to-voxel predictions, point-to-point regression, and pixel-wise estimations. Despite the good performance, those methods have a few issues in nature, such as the poor trade-off between accuracy and efficiency, and plain feature representation learning with local convolutions. In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training. Specifically, we first propose a graph convolutional network (GCN) based joint graph reasoning module to model the complex dependencies among joints and augment the representation capability of each pixel. Then we densely estimate all pixels' offsets to joints in both image plane and depth space and calculate the joints' positions by a weighted average over all pixels' predictions, totally discarding the complex post-processing operations. The proposed model is implemented with an efficient 2D fully convolutional network (FCN) backbone and has only about 1.4M parameters. Extensive experiments on multiple 3D hand pose estimation benchmarks demonstrate that the proposed method achieves new state-of-the-art accuracy while running very efficiently with around a speed of 110fps on a single NVIDIA 1080Ti GPU."



Paperid:238
Authors:Zhuo Su, Linpu Fang, Wenxiong Kang, Dewen Hu, Matti Pietikäinen, Li Liu
Title: Dynamic Group Convolution for Accelerating Convolutional Neural Networks
Abstract:
Replacing normal convolutions with group convolutions can significantly increase the computational efficiency of modern deep convolutional networks, which has been widely adopted in compact network architecture designs. However, existing group convolutions undermine the original network structures by cutting off some connections permanently resulting in significant accuracy degradation. In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. Specifically, we equip each group with a small feature selector to automatically select the most important input channels conditioned on the input images. Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image. The DGC preserves the original network structure and has similar computational efficiency as the conventional group convolution simultaneously. Extensive experiments on multiple image classification benchmarks including CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority over the existing group convolution techniques and dynamic execution methods. The code is available at https://github.com/zhuogege1943/dgc."



Paperid:239
Authors:Yaoxiong Huang, Mengchao He, Lianwen Jin, Yongpan Wang
Title: RD-GAN: Few/Zero-Shot Chinese Character Style Transfer via Radical Decomposition and Rendering
Abstract:
Style transfer has attracted much interest owing to its various applications. Compared with English character or general artistic style transfer, Chinese character style transfer remains a challenge owing to the large size of the vocabulary(70224 characters in GB18010-2005) and the complexity of the structure. Recently some GAN-based methods were proposed for style transfer; however, they treated Chinese characters as a whole, ignoring the structures and radicals that compose characters. In this paper, a novel radical decomposition-and-rendering-based GAN(RD-GAN) is proposed to utilize the radical-level compositions of Chinese characters and achieves few-shot/zero-shot Chinese character style transfer. The RD-GAN consists of three components: a radical extraction module (REM), radical rendering module (RRM), and multi-level discriminator (MLD). Experiments demonstrate that our method has a powerful few-shot/zero-shot generalization ability by using the radical-level compositions of Chinese characters."



Paperid:240
Authors:Yuhui Yuan, Xilin Chen, Jingdong Wang
Title: Object-Contextual Representations for Semantic Segmentation
Abstract:
In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. Our motivation is that the label of a pixel is the category of the object that the pixel belongs to. We present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we learn object regions under the supervision of the ground-truth segmentation. Second, we compute the object region representation by aggregating the representations of the pixels lying in the object region. Last, % the representation similarity we compute the relation between each pixel and each object region, and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations according to their relations with the pixel. We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Cityscapes, ADE20K, LIP, PASCAL-Context and COCO-Stuff. Our submission ""HRNet + OCR + SegFix"" achieves the 1-st place on the Cityscapes leaderboard by the time of submission. "



Paperid:241
Authors:Zhihang Zhong, Ye Gao, Yinqiang Zheng, Bo Zheng
Title: Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring
Abstract:
Real-time video deblurring still remains a challenging task due to the complexity of spatially and temporally varying blur itself and the requirement of low computational cost. To improve the network efficiency, we adopt residual dense blocks into RNN cells, so as to efficiently extract the spatial features of the current frame. Furthermore, a global spatio-temporal attention module is proposed to fuse the effective hierarchical features from past and future frames to help better deblur the current frame. For evaluation, we also collect a novel dataset with paired blurry/sharp video clips by using a co-axis beam splitter system. Through experiments on synthetic and realistic datasets, we show that our proposed method can achieve better deblurring performance both quantitatively and qualitatively with less computational cost against state-of-the-art video deblurring methods. "



Paperid:242
Authors:Steffen Wolf, Yuyan Li, Constantin Pape, Alberto Bailoni, Anna Kreshuk, Fred A. Hamprecht
Title: Joint Semantic Instance Segmentation on Graphs with the Semantic Mutex Watershed
Abstract:
Semantic instance segmentation is the task of simultaneously partitioning an image into distinct segments while associating each pixel with a class label. In commonly used pipelines, segmentation and label assignment are solved separately since joint optimization is computationally expensive. We propose a greedy algorithm for joint graph partitioning and labeling derived from the efficient Mutex Watershed partitioning algorithm. It optimizes an objective function closely related to the Symmetric Multiway Cut objective and empirically shows efficient scaling behavior. Due to the algorithm's efficiency, it can operate directly on pixels without prior over-segmentation of the image into superpixels. We evaluate the performance on the Cityscapes dataset (2D urban scenes) and on a 3D microscopy volume. In urban scenes, the proposed algorithm combined with current deep neural networks outperforms the strong baseline of `Panoptic Feature Pyramid Networks' by Kirillov et al. (2019). In the 3D electron microscopy images, we show explicitly that our joint formulation outperforms a separate optimization of the partitioning and labeling problems. "



Paperid:243
Authors:Jiayong Peng, Zhiwei Xiong, Xin Huang, Zheng-Ping Li, Dong Liu, Feihu Xu
Title: Photon-Efficient 3D Imaging with A Non-Local Neural Network
Abstract:
Photon-efficient imaging has enabled a number of applications relying on single-photon sensors that can capture a 3D image with as few as one photon per pixel. In practice, however, measurements of low photon counts are often mixed with heavy background noise, which poses a great challenge for existing computational reconstruction algorithms. In this paper, we first analyze the long-range correlations in both spatial and temporal dimensions of the measurements. Then we propose a non-local neural network for depth reconstruction by exploiting the long-range correlations. The proposed network achieves decent reconstruction fidelity even under photon counts (and signal-to-background ratio, SBR) as low as 1 photon/pixel (and 0.01 SBR), which significantly surpasses the state-of-the-art. Moreover, our non-local network trained on simulated data can be well generalized to different real-world imaging systems, which could extend the application scope of photon-efficient imaging in challenging scenarios with a strict limit on optical flux. Code is available at https://github.com/JiayongO-O/PENonLocal."



Paperid:244
Authors:Ricardo Martin-Brualla, Rohit Pandey, Sofien Bouaziz, Matthew Brown, Dan B Goldman
Title: GeLaTO: Generative Latent Textured Objects
Abstract:
Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct using classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars."



Paperid:245
Authors:Arjun Majumdar, Ayush Shrivastava, Stefan Lee, Peter Anderson, Devi Parikh, Dhruv Batra
Title: Improving Vision-and-Language Navigation with Image-Text Pairs from the Web
Abstract:
Following a navigation instruction such as 'Walk down the stairs and stop at the brown sofa' requires embodied AI agents to ground referenced scene elements referenced (e.g. 'stairs') to visual content in the environment (pixels corresponding to 'stairs'). We ask the following question -- can we leverage abundant `disembodied' web-scraped vision-and-language corpora (e.g. Conceptual Captions) to learn the visual groundings that improve performance on a relatively data-starved embodied perception task (Vision-and-Language Navigation)? Specifically, we develop VLN-BERT, a visiolinguistic transformer-based model for scoring the compatibility between an instruction ('...stop at the brown sofa') and a trajectory of panoramic RGB images captured by the agent. We demonstrate that pretraining VLN-BERT on image-text pairs from the web before fine-tuning on embodied path-instruction data significantly improves performance on VLN -- outperforming prior state-of-the-art in the fully-observed setting by 4 absolute percentage points on success rate. Ablations of our pretraining curriculum show each stage to be impactful -- with their combination resulting in further gains."



Paperid:246
Authors:Xinyu Li, Bing Shuai, Joseph Tighe
Title: Directional Temporal Modeling for Action Recognition
Abstract:
Many current activity recognition models use 3D convolutional neural networks (e.g. I3D, I3D-NL) to generate local spatial-temporal features. However, such features do not encode clip-level ordered temporal information. In this paper, we introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. By applying multiple CIDC units we construct a light-weight network that models the clip-level temporal evolution across multiple spatial scales. Our CIDC network can be attached to any activity recognition backbone network. We evaluate our method on four popular activity recognition datasets and consistently improve upon state-of-the-art techniques. We further visualize the activation map of our CIDC network and show that it is able to focus on more meaningful, action related parts of the frame."



Paperid:247
Authors:Frank Dellaert, David M. Rosen, Jing Wu, Robert Mahony, Luca Carlone
Title: Shonan Rotation Averaging: Global Optimality by Surfing SO(p)(n)
Abstract:
Shonan Rotation Averaging is a fast, simple, and elegant rotation averaging algorithm that is guaranteed to recover globally optimal solutions under mild assumptions on the measurement noise. Our method employs semidefinite relaxation in order to recover provably globally optimal solutions of the rotation averaging problem. In contrast to prior work, we show how to solve large-scale instances of these relaxations using manifold minimization on (only slightly) higher-dimensional rotation manifolds, re-using existing high-performance (but local) structure-from-motion pipelines. Our method thus preserves the speed and scalability of those, while enabling the recovery of globally optimal solutions."



Paperid:248
Authors:Devendra Singh Chaplot, Helen Jiang, Saurabh Gupta, Abhinav Gupta
Title: Semantic Curiosity for Active Visual Learning
Abstract:
In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector's failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation -- the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration."



Paperid:249
Authors:Dongwon Park, Dong Un Kang, Jisoo Kim, Se Young Chun
Title: Multi-Temporal Recurrent Neural Networks For Progressive Non-Uniform Single Image Deblurring With Incremental Temporal Training
Abstract:
Blind non-uniform image deblurring for severe blurs induced by large motions is still challenging. Multi-scale (MS) approach has been widely used for deblurring that sequentially recovers the downsampled original image in low spatial scale first and then further restores in high spatial scale using the result(s) from lower spatial scale(s). Here, we investigate a novel alternative approach to MS, called multi-temporal (MT), for non-uniform single image deblurring by exploiting time-resolved deblurring dataset from high-speed cameras. MT approach models severe blurs as a series of small blurs so that it deblurs small amount of blurs in the original spatial scale progressively instead of restoring the images in different spatial scales. To realize MT approach, we propose progressive deblurring over iterations and incremental temporal training with temporally augmented training data. Our MT approach, that can be seen as a form of curriculum learning in a wide sense, allows a number of state-of-the-art MS based deblurring methods to yield improved performances without using MS approach. We also proposed a MT recurrent neural network with recurrent feature maps that outperformed state-of-the-art deblurring methods with the smallest number of parameters."



Paperid:250
Authors:Jiashu Zhu, Dong Li, Tiantian Han, Lu Tian, Yi Shan
Title: ProgressFace: Scale-Aware Progressive Learning for Face Detection
Abstract:
Scale variation stands out as one of key challenges in face detection. Recent attempts have been made to cope with this issue by incorporating image / feature pyramids or adjusting anchor sampling / matching strategies. In this work, we propose a novel scale-aware progressive training mechanism to address large scale variations across faces. Inspired by curriculum learning, our method gradually learns large-to-small face instances. The preceding models learned with easier samples (i.e., large faces) can provide good initialization for succeeding learning with harder samples (i.e., small faces), ultimately deriving a better optimum of face detectors. Moreover, we propose an auxiliary anchor-free enhancement module to facilitate the learning of small faces by supplying positive anchors that may be not covered according to the criterion of IoU overlap. Such anchor-free module will be removed during inference and hence no extra computation cost is introduced. Extensive experimental results demonstrate the superiority of our method compared to the state-of-the-arts on the standard FDDB and WIDER FACE benchmarks. Especially, our ProgressFace-Light with MobileNet-0.25 backbone achieves 87.9% AP on the hard set of WIDER FACE, surpassing largely RetinaFace with the same backbone by 9.7%. Code and our trained face detection models are available at https://github.com/jiashu-zhu/ProgressFace."



Paperid:251
Authors:Erik Nijkamp, Bo Pang, Tian Han, Linqi Zhou, Song-Chun Zhu, Ying Nian Wu
Title: Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference
Abstract:
This paper studies the fundamental problem of learning deep generative models that consist of multiple layers of latent variables organized in top-down architectures. Such models have high expressivity and allow for learning hierarchical representations. Learning such a generative model requires inferring the latent variables for each training example based on the posterior distribution of these latent variables. The inference typically requires Markov chain Monte Caro (MCMC) that can be time consuming. In this paper, we propose to use noise initialized non-persistent short run MCMC, such as finite step Langevin dynamics initialized from the prior distribution of the latent variables, as an approximate inference engine, where the step size of the Langevin dynamics is variationally optimized by minimizing the Kullback-Leibler divergence between the distribution produced by the short run MCMC and the posterior distribution. Our experiments show that the proposed method outperforms variational auto-encoder (VAE) in terms of reconstruction error and synthesis quality. The advantage of the proposed method is that it is simple and automatic without the need to design an inference model."



Paperid:252
Authors:Zhensheng Shi, Cheng Guan, Liangjie Cao, Qianqian Li, Ju Liang, Zhaorui Gu, Haiyong Zheng, Bing Zheng
Title: CoTeRe-Net: Discovering Collaborative Ternary Relations in Videos
Abstract:
Modeling relations is crucial to understand videos for action and behavior recognition. Current relation models mainly reason about relations of invisibly implicit cues, while important relations of visually explicit cues are rarely considered, and the collaboration between them is usually ignored. In this paper, we propose a novel relation model that discovers relations of both implicit and explicit cues as well as their collaboration in videos. Our model concerns Collaborative Ternary Relations (CoTeRe), where the ternary relation involves channel (C, for implicit), temporal (T, for implicit), and spatial (S, for explicit) relation (R). We devise a flexible and effective CTSR module to collaborate ternary relations for 3D-CNNs, and then construct CoTeRe-Nets for action recognition. Extensive experiments on both ablation study and performance evaluation demonstrate that our CTSR module is significantly effective with approximate 3% gains and our CoTeRe-Nets outperform state-of-the-art approaches on three popular benchmarks. Boosts analysis and relations visualization also validate that relations of both implicit and explicit cues are discovered with efficacy by our method. Our code is available at https://github.com/zhenglab/cotere-net ."



Paperid:253
Authors:Frank Verbiest, Marc Proesmans, Luc Van Gool
Title: Modeling the Effects of Windshield Refraction for Camera Calibration
Abstract:
In this paper, we study the effects of windshield refraction for autonomous driving applications. These distortion effects are surprisingly large and can not be explained by traditional camera models. Instead of using a generalized camera approach, we propose a novel approach to jointly optimize a traditional camera model, and a mathematical representation of the windshield’s surface. First, using the laws of geometric optics, the refraction is modeled using a local spherical approximation to the windshield’s geometry. Next, a spline-based model is proposed as a refinement to better adapt to deviations from the ideal shape and manufacturing variations. By jointly optimizing refraction and camera parameters, the projection error can be significantly reduced. The proposed models are validated on real windshield observations and custom setups to compare recordings with and without windshield, with accurate laser scan measurements as 3D ground truth."



Paperid:254
Authors:Prashant Pandey, Aayush Kumar Tyagi, Sameer Ambekar, Prathosh AP
Title: Unsupervised Domain Adaptation for Semantic Segmentation of NIR Images through Generative Latent Search
Abstract:
Segmentation of the pixels corresponding to human skin is an essential first step in multiple applications ranging from surveillance to heart-rate estimation from remote-photoplethysmography. However, the existing literature considers the problem only in the visible-range of the EM-spectrum which limits their utility in low or no light settings where the criticality of the application is higher. To alleviate this problem, we consider the problem of skin segmentation from the Near-infrared images. However, Deep learning based state-of-the-art segmentation techniques demands large amounts of labelled data that is unavailable for the current problem. Therefore we cast the skin segmentation problem as that of target-independent Unsupervised Domain Adaptation (UDA) where we use the data from the Red-channel of the visible-range to develop skin segmentation algorithm on NIR images. We propose a method for target-independent segmentation where the 'nearest-clone' of a target image in the source domain is searched and used as a proxy in the segmentation network trained only on the source domain. We prove the existence of 'nearest-clone' and propose a method to find it through an optimization algorithm over the latent space of a Deep generative model based on variational inference. We demonstrate the efficacy of the proposed method for NIR skin segmentation over the state-of-the-art UDA segmentation methods on the two newly created skin segmentation datasets in NIR domain despite not having access to the target NIR data. Additionally, we report state-of-the-art results for adaption from Synthia to Cityscapes which is a popular setting in Unsupervised Domain Adaptation for semantic segmentation. The code and datasets are available at https://github.com/ambekarsameer96/GLSS."



Paperid:255
Authors:Eunhyeok Park, Sungjoo Yoo
Title: PROFIT: A Novel Training Method for sub-4-bit MobileNet Models
Abstract:
4-bit and lower precision mobile models are required due to the ever-increasing demand for better energy efficiency in mobile devices. In this work, we report that the activation instability induced by weight quantization (AIWQ) is the key obstacle to sub-4-bit quantization of mobile networks. To alleviate the AIWQ problem, we propose a novel training method called PROgressive-Freezing Iterative Training (PROFIT), which attempts to freeze layers whose weights are affected by the instability problem stronger than the other layers. We also propose a differentiable and unified quantization method (DuQ) and a negative padding idea to support asymmetric activation functions such as h-swish. We evaluate the proposed methods by quantizing MobileNet-v1, v2, and v3 on ImageNet and report that 4-bit quantization offers comparable (within 1.48 % top-1 accuracy) accuracy to full precision baseline. In the ablation study of the 3-bit quantization of MobileNet-v3, our proposed method outperforms the state-of-the-art method by a large margin, 12.86 % of top-1 accuracy. The quantized model and source code is available at https://github.com/EunhyeokPark/PROFIT."



Paperid:256
Authors:Junbin Xiao, Xindi Shang, Xun Yang, Sheng Tang, Tat-Seng Chua
Title: Visual Relation Grounding in Videos
Abstract:
In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). The task aims at spatio-temporally localizing the given relations in the form of subject-predicate-object in the videos, so as to provide supportive visual facts for other high-level video-language tasks (e.g., video-language grounding and video question answering). The challenges in this task include but not limited to: (1) both the subject and object are required to be spatio-temporally localized to ground a query relation; (2) the temporal dynamic nature of visual relations in videos is difficult to capture; and (3) the grounding should be achieved without any direct supervision in space and time. To ground the relations, we tackle the challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical spatio-temporal region graph through relation attending and reconstruction, in which we further propose a message passing mechanism by spatial attention shifting between visual entities. Experimental results demonstrate that our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts to support visual grounding. (Code is available at https://github.com/doc-doc/vRGV)."



Paperid:257
Authors:Andrei Zanfir, Eduard Gabriel Bazavan, Hongyi Xu, William T. Freeman, Rahul Sukthankar, Cristian Sminchisescu
Title: Weakly Supervised 3D Human Pose and Shape Reconstruction with Normalizing Flows
Abstract:
Monocular 3D human pose and shape estimation is challenging due to the many degrees of freedom of the human body and the difficulty to acquire training data for large-scale supervised learning in complex visual scenes where humans with diverse shape and appearance, appear against complex backgrounds,in a variety of poses, and are partially occluded, or involved in interactions. Essential to learning is leveraging effective 3D human priors, and the ability to work under weak supervision, at scale, by exploiting, to the largest extent, the detailed human body semantics in images. In this paper we present new priors as well as large-scale weakly supervised models for 3D human pose and shape estimation. Key to our formulation are new latent normalizing flow representations, as well as fully differentiable, structurally-sensitive, semantic body part alignment(re-projection) loss functions that ensure consistent estimates and sharp feedback signals for learning. In extensive experiments using both motion capture datasets like CMU, Human3.6M, 3DPW, or AMASS, and repositories like COCO, we show that our proposed methods outperform existing counterparts, supporting the construction of an increasingly more accurate family of models based on large-scale training with unlabeled image data. "



Paperid:258
Authors:Dario Pavllo, Aurelien Lucchi, Thomas Hofmann
Title: Controlling Style and Semantics in Weakly-Supervised Image Generation
Abstract:
We propose a weakly-supervised approach for conditional image generation of complex scenes where a user has fine control over objects appearing in the scene. We exploit sparse semantic maps to control object shapes and classes, as well as textual descriptions or attributes to control both local and global style. In order to condition our model on textual descriptions, we introduce a semantic attention module whose computational cost is independent of the image resolution. To further augment the controllability of the scene, we propose a two-step generation scheme that decomposes background and foreground. The label maps used to train our model are produced by a large-vocabulary object detector, which enables access to unlabeled data and provides structured instance information. In such a setting, we report better FID scores compared to fully-supervised settings where the model is trained on ground-truth semantic maps. We also showcase the ability of our model to manipulate a scene on complex datasets such as COCO and Visual Genome."



Paperid:259
Authors:Daniel R. Kepple, Daewon Lee, Colin Prepsius, Volkan Isler, Il Memming Park, Daniel D. Lee
Title: Jointly learning visual motion and confidence from local patches in event cameras
Abstract:
We propose the first network to jointly learn visual motion and confidence from events in spatially local patches. Event-based sensors deliver high temporal resolution motion information in a sparse, non-redundant format. This creates the potential for low computation, low latency motion recognition. Neural networks which extract global motion information, however, are generally computationally expensive. Here, we introduce a novel shallow and compact neural architecture and learning approach to capture reliable visual motion information along with the corresponding confidence of inference. Our network makes a prediction of the visual motion at each spatial location using only local events. Our confidence network then identifies which of these predictions will be accurate. In the task of recovering pan-tilt ego velocities from events, we show that each individual confident local prediction of our network can be expected to be as accurate as state of the art optimization approaches which utilize the full image. Furthermore, on a publicly available dataset, we find our local predictions generalize to scenes with camera motions and the presence of independently moving objects. This makes the output of our network well suited for motion based tasks, such as the segmentation of independently moving objects. We demonstrate on a publicly available motion segmentation dataset that restricting predictions to confident regions is sufficient to achieve results that exceed state of the art methods. "



Paperid:260
Authors:Soichiro Fujita, Tsutomu Hirao, Hidetaka Kamigaito, Manabu Okumura, Masaaki Nagata
Title: SODA: Story Oriented Dense Video Captioning Evaluation Framework
Abstract:
Dense Video Captioning (DVC) is a challenging task that localizes all events in a short video and describes them with natural language sentences. The main goal of DVC is video story description, that is, to generate a concise video story that supports human video comprehension without watching it. In recent years, DVC has attracted increasing attention in the research community of Vision and Language, and has been employed as a task of the workshop, ActivityNet Challenge. In the current research community, the official scorer provided by AcivityNet Challenge is the de-facto standard evaluation framework for DVC systems. It computes averaged METEOR scores for matched pairs between generated and reference captions whose Interval of Union (IoU) exceeds a specific threshold value. However, the current framework does not take into account the story of the video, the ordering of captions. It also tends to give high scores to systems that generate redundant, several hundred captions, that humans cannot read. This paper proposes a new evaluation framework, Story Oriented Dense video cAptioning evaluation framework (SODA), for measuring the performance of video story description systems. SODA first tries to find temporally optimal matching between generated and reference captions to capture a story for a video. Then, it computes METEOR scores for the matching and derives F-measure scores from the METEOR scores to penalize redundant captions. To demonstrate that SODA gives low scores for inadequate captions in terms of video story description, we evaluate two state-of-the-art systems with it, varying the number of captions. The results show that SODA can give low scores against too many or too few captions and high scores against captions whose number equals to that of a reference, while the current framework gives good scores for all the cases. Furthermore, we show that SODA tends to give lower scores than the current evaluation framework in evaluating captions with incorrect order."



Paperid:261
Authors:Aditay Tripathi, Rajath R. Dani, Anand Mishra and Anirban Chakraborty
Title: Sketch-Guided Object Localization in Natural Images
Abstract:
We introduce a novel problem of localizing all the instances of an object (seen or unseen during training) in a natural image via sketch query. We refer to this problem as sketch-guided object localization. This problem is distinctively different from the traditional sketch-based image retrieval task where the gallery set often contains images with only one object. The sketch-guided object localization proves to be more challenging when we consider the following: (i) the sketches used as queries are abstract representations with little information on the shape and salient attributes of the object, (ii) the sketches have significant variability as they are hand-drawn by a diverse set of untrained human subjects, and (iii) there exists a domain gap between sketch queries and target natural images as these are sampled from very different data distributions. To address the problem of sketch-guided object localization, we propose a novel mph{cross-modal attention} scheme that guides the region proposal network (RPN) to generate object proposals relevant to the sketch query. These object proposals are later scored against the query to obtain final localization. Our method is effective with as little as a single sketch query. Moreover, it also generalizes well to object categories not seen during training and is effective in localizing multiple object instances present in the image. Furthermore, we extend our framework to a multi-query setting using novel feature fusion and attention fusion strategies introduced in this paper. The localization performance is evaluated on publicly available object detection benchmarks, viz. MS-COCO and PASCAL-VOC, with sketch queries obtained from `Quick, Draw!'. The proposed method significantly outperforms related baselines on both single-query and multi-query localization tasks."



Paperid:262
Authors:Malik Boudiaf, Jérôme Rony, Imtiaz Masud Ziko, Eric Granger, Marco Pedersoli, Pablo Piantanida, Ismail Ben Ayed
Title: A unifying mutual information view of metric learning: cross-entropy vs. pairwise losses
Abstract:
Recently, substantial research efforts in Deep Metric Learning (DML) focused on designing complex pairwise-distance losses, which require convoluted schemes to ease optimization, such as sample mining or pair weighting. The standard cross-entropy loss for classification has been largely overlooked in DML. On the surface, the cross-entropy may seem unrelated and irrelevant to metric learning as it does not explicitly involve pairwise distances. However, we provide a theoretical analysis that links the cross-entropy to several well-known and recent pairwise losses. Our connections are drawn from two different perspectives: one based on an explicit optimization insight; the other on discriminative and generative views of the mutual information between the labels and the learned features. First, we explicitly demonstrate that the cross-entropy is an upper bound on a new pairwise loss, which has a structure similar to various pairwise losses: it minimizes intra-class distances while maximizing inter-class distances. As a result, minimizing the cross-entropy can be seen as an approximate bound-optimization (or Majorize-Minimize) algorithm for minimizing this pairwise loss. Second, we show that, more generally,minimizing the cross-entropy is actually equivalent to maximizing the mutual information, to which we connect several well-known pairwise losses. Furthermore, we show that various standard pairwise losses can be explicitly related to one another via bound relationships. Our findings indicate that the cross-entropy represents a proxy for maximizing the mutual information – as pairwise losses do – without the need for convoluted sample-mining heuristics. Our experiments†over four standard DML benchmarks strongly support our findings. We obtain state-of-the-art results, outperforming recent and complex DML methods."



Paperid:263
Authors:Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, Jingjing Liu
Title: Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
Abstract:
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection) generalizable to standard pre-trained V+L models, to decipher the inner workings of multimodal pre-training (e.g., implicit knowledge garnered in individual attention heads, inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training."



Paperid:264
Authors:William Peebles, John Peebles, Jun-Yan Zhu, Alexei Efros, Antonio Torralba
Title: The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement
Abstract:
Existing popular methods for disentanglement rely on hand-picked priors and complex encoder-based architectures. In this paper, we propose the Hessian Penalty, a simple regularization function that encourages the input Hessian of a function to be diagonal. Our method is completely model-agnostic and can be applied to any deep generator with just a few lines of code. We show that our method automatically uncovers meaningful factors of variation in the standard basis when applied to ProgressiveGAN across several datasets. Additionally, we demonstrate that our regularization term can be used to identify interpretable directions in BigGAN's latent space in a fully unsupervised fashion. Finally, we provide provide empirical evidence that our regularization term encourages sparsity when applied to overparameterized latent spaces."



Paperid:265
Authors:Ahmed A. A. Osman, Timo Bolkart, Michael J. Black
Title: STAR: Sparse Trained Articulated Human Body Regressor
Abstract:
The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range cor- relations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose- corrective blendshapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at http://star.is.tue.mpg.de."



Paperid:266
Authors:Xinghao Chen, Yiman Zhang, Yunhe Wang, Han Shu, Chunjing Xu, Chang Xu
Title: Optical Flow Distillation: Towards Efficient and Stable Video Style Transfer
Abstract:
Video style transfer techniques inspire many exciting applications on mobile devices. However, their efficiency and stability are still far from satisfactory. To boost the transfer stability across frames, optical flow is widely adopted, despite its high computational complexity, e.g., occupying over 97% inference time. This paper proposes to learn a lightweight video style transfer network via knowledge distillation paradigm. We adopt two teacher networks, one of which takes optical flow during inference while the other does not. The output difference between these two teacher networks highlights the improvements made by optical flow, which is then adopted to distill the target student network. Furthermore, a low-rank distillation loss is employed to stabilize the output of student network by mimicking the rank of input videos. Extensive experiments demonstrate that our student network without an optical flow module is still able to generate stable video and runs much faster than the teacher network."



Paperid:267
Authors:Sihui Luo, Wenwen Pan, Xinchao Wang, Dazhou Wang, Haihong Tang, Mingli Song
Title: Collaboration by Competition: Self-coordinated Knowledge Amalgamation for Multi-talent Student Learning
Abstract:
A vast number of well-trained deep networks have been released by developers online for plug-and-play use. These networks specialize in different tasks and in many cases, the data and annotations used to train them are not publicly available. In this paper, we study how to reuse such heterogeneous pre-trained models as teachers, and build a versatile and compact student model, without accessing human annotations. To this end, we propose a self-coordinate knowledge amalgamation network (SOKA-Net) for learning the multi-talent student model. This is achieved via a dual-step adaptive competitive-cooperation training approach, where the knowledge of the heterogeneous teachers are in the first step amalgamated to guide the shared parameter learning of the student network, and followed by a gradient-based competition-balancing strategy to learn the multi-head prediction network as well as the loss weightings of the distinct tasks in the second step. The two steps, which we term as the collaboration and competition step respectively, are performed alternatively until the balance of the competition is reached for the ultimate collaboration. Experimental results demonstrate that, the learned student not only comes with a smaller size but all achieves performances on par with or even superior to those of the teachers."



Paperid:268
Authors:Shizhen Zhao, Changxin Gao, Jun Zhang, Hao Cheng, Chuchu Han, Xinyang Jiang, Xiaowei Guo, Wei-Shi Zheng, Nong Sang, Xing Sun
Title: Do Not Disturb Me: Person Re-identification Under the Interference of Other Pedestrians
Abstract:
In the conventional person Re-ID setting, it is assumed that cropped images are the person images within the bounding box for each individual. However, in a crowded scene, off-shelf-detectors may generate bounding boxes involving multiple people, where the large proportion of background pedestrians or human occlusion exists. The representation extracted from such cropped images, which contain both the target and the interference pedestrians, might include distractive information. This will lead to wrong retrieval results. To address this problem, this paper presents a novel deep network termed Pedestrian-Interference Suppression Network (PISNet). PISNet leverages a Query-Guided Attention Block (QGAB) to enhance the feature of the target in the gallery, under the guidance of the query. Furthermore, the involving Guidance Reversed Attention Module and the Multi-Person Separation Loss promote QGAB to suppress the interference of other pedestrians. Our method is evaluated on two new pedestrian-interference datasets and the results show that the proposed method performs favorably against existing Re-ID methods."



Paperid:269
Authors:Yichen Li, Kaichun Mo, Lin Shao, Minhyuk Sung, Leonidas Guibas
Title: Learning 3D Part Assembly from a Single Image
Abstract:
Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains under-explored. Towards this end, we introduce a novel problem,single-image-guided 3D part assembly, along with a learning-based solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies,whether visible or occluded. We address these issues by proposing a two-module pipeline that leverages strong 2D-3D correspondences and assembly-oriented graph message-passing to infer part relationships. In experiments with a PartNet-based synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches."



Paperid:270
Authors:Kaichun Mo, He Wang, Xinchen Yan, Leonidas Guibas
Title: PT2PC: Learning to Generate 3D Point Cloud Shapes from Part Tree Conditions
Abstract:
Generative 3D shape modeling is a fundamental research area in computer vision and interactive computer graphics, with many real-world applications. This paper investigates the novel problem of generating a 3D point cloud geometry for a shape from a symbolic part tree representation. In order to learn such a conditional shape generation procedure in an end-to-end fashion, we propose a conditional GAN “part tree”-to-“point cloud” model (PT2PC) that disentangles the structural and geometric factors. The proposed model incorporates the part tree condition into the architecture design by passing messages top-down and bottom-up along the part tree hierarchy. Experimental results and user study demonstrate the strengths of our method in generating perceptually plausible and diverse 3D point clouds, given the part tree condition. We also propose a novel structural measure for evaluating if the generated shape point clouds satisfy the part tree conditions."



Paperid:271
Authors:Shang-Hua Gao, Yong-Qiang Tan, Ming-Ming Cheng, Chengze Lu, Yunpeng Chen, Shuicheng Yan
Title: Highly Efficient Salient Object Detection with 100K Parameters
Abstract:
Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices. In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree. We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% parameters (100k) of large models on popular salient object detection benchmarks. The source code is publicly available at https://mmcheng.net/sod100k."



Paperid:272
Authors:Qili Deng, Ziling Huang, Chung-Chi Tsai, Chia-Wen Lin
Title: HardGAN: A Haze-Aware Representation Distillation GAN for Single Image Dehazing
Abstract:
In this paper, we present a Haze-Aware Representation Distillation Generative Adversarial Network named HardGAN for single-image dehazing. Unlike previous studies that intend to model the transmission map and global atmospheric light jointly to restore a clear image, we solve this regression problem by a multi-scale structure neural network embedded with our proposed Haze-Aware Representation Distillation (HARD) layer. Moreover, we re-introduce to utilize the normalization layer skillfully instead of stacking with the convolution layer directly as before to avoid the useful information wash away, as claimed in many image quality enhancement studies. Extensive experiment on several synthetic benchmark datasets as well as the NTIRE 2020 real-world images show our proposed multi-layer GAN-based network with HARD performs favorably against the state-of-the-art methods in terms of PSNR, SSIM, LPIPS, and human subjective evaluation."



Paperid:273
Authors:Roy Or-El, Soumyadip Sengupta, Ohad Fried, Eli Shechtman, Ira Kemelmacher-Shlizerman
Title: Lifespan Age Transformation Synthesis
Abstract:
We address the problem of single photo age progression and regression---the prediction of how a person might look in the future, or how they looked in the past. Most existing aging methods are limited to changing the texture, overlooking transformations in head shape that occur during the human aging and growth process. This limits the applicability of previous methods to aging of adults to slightly older adults, and application of those methods to photos of children does not produce quality results. We propose a new multi domain image-to-image generative adversarial network architecture, whose learned latent space accurately models the continuous aging process in both directions. The network is trained on the FFHQ dataset, which we labeled for ages, gender, and semantic segmentation, where fixed age classes are used as anchors to approximate the continuous age transformation. Our framework can predict a full head portrait in ages 0-70 from a single photo, modifying both texture and shape of the head. We demonstrate results on a wide variety of photos and datasets, and show significant improvement over the state of the art."



Paperid:274
Authors:Xingchao Peng, Yichen Li, Kate Saenko
Title: Domain2Vec: Domain Embedding for Unsupervised Domain Adaptation
Abstract:
Conventional unsupervised domain adaptation (UDA) studies the knowledge transfer between a limited number of domains. This neglects the more practical scenario where data are distributed in numerous different domains in the real world. The domain similarity between those domains is critical for domain adaptation performance. To describe and learn relations between different domains, we propose a novel Domain2Vec model to provide vectorial representations of visual domains based on joint learning of feature disentanglement and Gram matrix. To evaluate the effectiveness of our Domain2Vec model, we create two large-scale cross-domain benchmarks. The first one is TinyDA, which contains 54 domains and about one million MNIST-style images. The second benchmark is DomainBank, which is collected from 56 existing vision datasets. We demonstrate that our embedding is capable of predicting domain similarities that match our intuition about visual relations between different domains. Extensive experiments are conducted to demonstrate the power of our new datasets in benchmarking state-of-the-art multi-source domain adaptation methods, as well as the advantage of our proposed model. "



Paperid:275
Authors:Yue Yao, Liang Zheng, Xiaodong Yang, Milind Naphade, Tom Gedeon
Title: Simulating Content Consistent Vehicle Datasets with Attribute Descent
Abstract:
This paper uses a graphic engine to simulate a large amount of training data with free annotations. Between synthetic and real data, there is a two-level domain gap, i.e., content level and appearance level. While the latter has been widely studied, we focus on reducing the content gap in attributes like illumination and viewpoint. To reduce the problem complexity, we choose a smaller and more controllable application, vehicle re-identification (re-ID). We introduce a large-scale synthetic dataset VehicleX. Created in Unity, it contains 1,362 vehicles of various 3D models with fully editable attributes. We propose an attribute descent approach to let VehicleX approximate the attributes in real-world datasets. Specifically, we manipulate each attribute in VehicleX, aiming to minimize the discrepancy between VehicleX and real data in terms of the Fr ́echet Inception Distance (FID). This attribute descent algorithm allows content domain adaptation (DA) orthogonal to existing appearance DA methods. We mix the optimized VehicleX data with real-world vehicle re-ID datasets, and observe consistent improvement. With the augmented datasets, we report competitive accuracy. We make the dataset, engine and our codes available at https://github.com/yorkeyao/VehicleX/."



Paperid:276
Authors:Yunzhong Hou, Liang Zheng, Stephen Gould
Title: Multiview Detection with Feature Perspective Transformation
Abstract:
Incorporating multiple camera views for detection alleviates the impact of occlusions in crowded scenes. In a multiview detection system, we need to answer two important questions. First, how should we aggregate cues from multiple views? Second, how should we aggregate information from spatially neighboring locations? To address these questions, we introduce a novel multiview detector, MVDet. During multiview aggregation, for each location on the ground, existing methods use multiview anchor box features as representation, which potentially limits performance as pre-defined anchor boxes can be inaccurate. In contrast, via feature map perspective transformation, MVDet employs anchor-free representations with feature vectors directly sampled from corresponding pixels in multiple views. For spatial aggregation, different from previous methods that require design and operations outside of neural networks, MVDet takes a fully convolutional approach with large convolutional kernels on the multiview aggregated feature map. The proposed model is end-to-end learnable and achieves 88.2% MODA on Wildtrack dataset, outperforming the state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a newly introduced synthetic dataset, MultiviewX, which allows us to control the level of occlusion. Code and MultiviewX dataset are available at https://github.com/hou-yz/MVDet/."



Paperid:277
Authors:Heming Du, Xin Yu, Liang Zheng
Title: Learning Object Relation Graph and Tentative Policy for Visual Navigation
Abstract:
Target-driven visual navigation aims at navigating an agent towards a given target based on the observation of the agent. In this task, it is critical to learn informative visual representation and robust navigation policy. Aiming to improve these two components, this paper proposes three complementary techniques, object relation graph (ORG), trial-driven imitation learning (IL), and a memory-augmented tentative policy network (TPN). ORG improves visual representation learning by integrating object relationships, including category closeness and spatial correlations, mph{e.g.,} a TV usually co-occurs with a remote spatially. Both Trial-driven IL and TPN underlie robust navigation policy, instructing the agent to escape from deadlock states, such as looping or being stuck. Specifically, trial-driven IL is a type of supervision used in policy network training, while TPN, mimicking the IL supervision in unseen environment, is applied in testing. %instructing an agent to escape from deadlock states. Experiment in the artificial environment AI2-Thor validates that each of the techniques is effective. When combined, the techniques bring significantly improvement over baseline methods in navigation effectiveness and efficiency in unseen environments. We report 22.8\% and 23.5\% increase in success rate and Success weighted by Path Length (SPL), respectively. The code is available at \url{https://github.com/xiaobaishu0097/ECCV-VN.git}."



Paperid:278
Authors:Chenyang Si, Xuecheng Nie, Wei Wang, Liang Wang, Tieniu Tan, Jiashi Feng
Title: Adversarial Self-Supervised Learning for Semi-Supervised 3D Action Recognition
Abstract:
We consider the problem of semi-supervised 3D action recognition which has been rarely explored before. Its major challenge lies in how to effectively learn motion representations from unlabeled data. Self-supervised learning (SSL) has been proved very effective at learning representations from unlabeled data in the image domain. However, few effective self-supervised approaches exist for 3D action recognition, and directly applying SSL for semi-supervised learning suffers from misalignment of representations learned from SSL and supervised learning tasks. To address these issues, we present Adversarial Self-Supervised Learning (ASSL), a novel framework that tightly couples SSL and the semi-supervised scheme via neighbor relation exploration and adversarial learning. Specifically, we design an effective SSL scheme to improve the discrimination capability of learned representations for 3D action recognition, through exploring the data relations within a neighborhood. We further propose an adversarial regularization to align the feature distributions of labeled and unlabeled samples. To demonstrate effectiveness of the proposed ASSL in semi-supervised 3D action recognition, we conduct extensive experiments on NTU and N-UCLA datasets. The results confirm its advantageous performance over state-of-the-art semi-supervised methods in the few label regime for 3D action recognition."



Paperid:279
Authors:Liad Pollak Zuckerman, Eyal Naor, George Pisha, Shai Bagon, Michal Irani
Title: Across Scales & Across Dimensions: Temporal Super-Resolution using Deep Internal Learning
Abstract:
When a very fast dynamic event is recorded with a low-framerate camera, the resulting video suffers from severe motion blur (due to exposure time) and motion aliasing (due to low sampling rate in time). True Temporal Super-Resolution (TSR) is more than just Temporal-Interpolation (increasing framerate). It can also recover new high temporal frequencies beyond the temporal Nyquist limit of the input video, thus resolving both motion-blur and motion aliasing – effects that temporal frame interpolation (as sophisticated as it may be) cannot undo. In this paper we propose a “Deep Internal Learning” approach for true TSR. We train a video-specific FCN on examples extracted directly from the low-framerate input video. Our method exploits the strong recurrence of small space-time patches inside a single video sequence, both within and across different spatio-temporal scales of the video.We further observe (for the first time) that small space-time patches recur also across-dimensions of the video sequence – i.e., by swapping the spatial and temporal dimensions. In particular, the higher spatial resolution of video frames provides strong examples as to how to increase the temporal resolution of that video. Such internal video-specific examples give rise to strong self-supervision, requiring no data but the input video itself. This results in Zero-Shot Temporal-SR of complex videos, which removes both motion blur and motion aliasing, outperforming previous supervised methods trained on external video datasets."



Paperid:280
Authors:Binod Bhattarai, Tae-Kyun Kim
Title: Inducing Optimal Attribute Representations for Conditional GANs
Abstract:
Conditional GANs (cGANs) are widely used in translating an image from one category to another. Meaningful conditions on GANsprovide greater flexibility and control over the nature of the target domain synthetic data. Existing conditional GANs commonly encode target domain label information as hard-coded categorical vectors in the form of0s and 1s. The major drawbacks of such representations are inability to encode the high-order semantic information of target categories and their relative dependencies. We propose a novel end-to-end learning framework based on Graph Convolutional Networks to learn the attribute representations to condition the generator. The GAN losses, the discriminator and attribute classification loss, are fed back to the graph resulting in the synthetic images that are more natural and clearer with respect to the attributes generation. Moreover, prior-arts are mostly given priorities to condition on the generator side, not on the discriminator side of GANs.We apply the conditions on the discriminator side as well via multi-task learning. We enhanced four state-of-the-art cGANs architectures: Stargan, Stargan-JNT, AttGAN and STGAN. Our extensive qualitative and quantitative evaluations on challenging face attributes manipulation data set,CelebA, LFWA, and RaFD, show that the cGANs enhanced by our meth-ods outperform by a large margin, compared to their counter-parts and other conditioning methods, in terms of both target attributes recognition rates and quality measures such as PSNR and SSIM."



Paperid:281
Authors:Yue Meng, Chung-Ching Lin, Rameswar Panda, Prasanna Sattigeri, Leonid Karlinsky, Aude Oliva, Kate Saenko, Rogerio Feris
Title: AR-Net: Adaptive Frame Resolution for Efficient Action Recognition
Abstract:
Action recognition is an open and challenging problem in computer vision. While current state-of-the-art models offer excellent recognition results, their computational expense limits their impact for many real-world applications. In this paper, we propose a novel approach, called AR-Net (Adaptive Resolution Network), that selects on-the-fly the optimal resolution for each frame conditioned on the input for efficient action recognition in long untrimmed videos. Specifically, given a video frame, a policy network is used to decide what input resolution should be used for processing by the action recognition model, with the goal of improving both accuracy and efficiency. We efficiently train the policy network jointly with the recognition model using standard back-propagation. Extensive experiments on several challenging action recognition benchmark datasets well demonstrate the efficacy of our proposed approach over state-of-the-art methods. The project page can be found at https://mengyuest.github.io/AR-Net"



Paperid:282
Authors:Vladimir V. Kniaz, Vladimir A. Knyaz, Fabio Remondino, Artem Bordodymov, Petr Moshkantsev
Title: Image-to-Voxel Model Translation for 3D Scene Reconstruction and Segmentation
Abstract:
Objects class, depth, and shape are instantly reconstructed by a human looking at a 2D image. While modern deep models solve each of these challenging tasks separately, they struggle to perform simultaneous scene 3D reconstruction and segmentation. We propose a single shot image-to-semantic voxel model translation framework. We train a generator adversarially against a discriminator that verifies the object's poses. Furthermore, trapezium-shaped voxels, volumetric residual blocks, and 2D-to-3D skip connections facilitate our model learning explicit reasoning about 3D scene structure. We collected a SemanticVoxels dataset with 116k images, ground-truth semantic voxel models, depth maps, and 6D object poses. Experiments on ShapeNet and our SemanticVoxels datasets demonstrate that our framework achieves and surpasses state-of-the-art in the reconstruction of scenes with multiple non-rigid objects of different classes. We made our model and dataset publicly available"



Paperid:283
Authors:Yuhua Chen, Luc Van Gool, Cordelia Schmid, Cristian Sminchisescu
Title: Consistency Guided Scene Flow Estimation
Abstract:
Consistency Guided Scene Flow Estimation (CGSF) is a self-supervised framework for the joint reconstruction of 3D scene structure and motion from stereo video. The model takes two temporal stereo pairs as input, and predicts disparity and scene flow. The model self-adapts at test time by iteratively refining its predictions. The refinement process is guided by a consistency loss, which combines stereo and temporal photo-consistency with a geometric term that couples disparity and 3D motion. To handle inherent modeling error in the consistency loss (e.g. Lambertian assumptions) and for better generalization, we further introduce a learned, output refinement network, which takes the initial predictions, the loss, and the gradient as input, and efficiently predicts a correlated output update. In multiple experiments, including ablation studies, we show that the proposed model can reliably predict disparity and scene flow in challenging imagery, achieves better generalization than the state-of-the-art, and adapts quickly and robustly to unseen domains. "



Paperid:284
Authors:Yassine Ouali, Céline Hudelot, Myriam Tami
Title: Autoregressive Unsupervised Image Segmentation
Abstract:
In this work, we propose a new unsupervised image segmentation approach based on mutual information maximization between different constructed views of the inputs. Taking inspiration from autoregressive generative models, that predict the current pixel from past pixels in a raster-scan ordering created with masked convolutions, we propose different autoregressive orderings over the inputs using various forms of masked convolutions to construct different views of the data. For a given input, the model produces a pair of predictions with two valid orderings, and is then trained to maximize the mutual information between the two outputs. These outputs can either be low-dimensional features for representation learning or output clusters corresponding to semantic labels for clustering. While masked convolutions are used during training, in inference, no masking is applied and we fall back to the standard convolution where the model has access to the full input. The proposed method outperforms the current state-of-the-art on unsupervised image segmentation. It is simple and easy to implement, and can be extended to other visual tasks and integrated seamlessly into existing unsupervised learning methods requiring different views of the data."



Paperid:285
Authors:Yen-Chi Cheng, Hsin-Ying Lee, Min Sun, Ming-Hsuan Yang
Title: Controllable Image Synthesis via SegVAE
Abstract:
Flexible user controls are desirable for content creation and image editing. A semantic map is commonly used intermediate representation for conditional image generation. Compared to the operation on raw RGB pixels, the semantic map enables simpler user modification. In this work, we specifically target at generating semantic maps given a label-set consisting of desired categories. The proposed framework, SegVAE, synthesizes semantic maps in an iterative manner using conditional variational autoencoder. Quantitative and qualitative experiments demonstrate that the proposed model can generate realistic and diverse semantic maps. We also apply an off-the-shelf image-to-image translation model to generate realistic RGB images to better understand the quality of the synthesized semantic maps. Furthermore, we showcase several real-world image-editing applications including object removal, object insertion, and object replacement."



Paperid:286
Authors:Yuan Tian, Qin Wang, Zhiwu Huang, Wen Li, Dengxin Dai, Minghao Yang , Jun Wang, Olga Fink
Title: Off-Policy Reinforcement Learning for Efficient and Effective GAN Architecture Search
Abstract:
In this paper, we introduce a new reinforcement learning (RL) based neural architecture search (NAS) methodology for effective and efficient generative adversarial network (GAN) architecture search. The key idea is to formulate the GAN architecture search problem as a Markov decision process (MDP) for smoother architecture sampling, which enables a more effective RL-based search algorithm by targeting the potential global optimal architecture. To improve efficiency, we exploit an off-policy GAN architecture search algorithm that makes efficient use of the samples generated by previous policies. Evaluation on two standard benchmark datasets (i.e., CIFAR-10 and STL-10) demonstrates that the proposed method is able to discover highly competitive architectures for generally better image generation results with a considerably reduced computational burden: 7 GPU hours. Our code is available at https://github.com/Yuantian013/E2GAN."



Paperid:287
Authors:Mariko Isogawa, Dorian Chan, Ye Yuan, Kris Kitani, Matthew O’Toole
Title: Efficient Non-Line-of-Sight Imaging from Transient Sinograms
Abstract:
Non-line-of-sight (NLOS) imaging techniques use light that diffusely reflects off of visible surfaces (e.g., walls) to see around corners. One approach involves using pulsed lasers and ultrafast sensors to measure the travel time of multiply scattered light. Unlike existing NLOS techniques that generally require densely raster scanning points across the entirety of a relay wall, we explore a more efficient form of NLOS scanning that reduces both acquisition times and computational requirements. We propose a circular and confocal non-line-of-sight (C$^2$NLOS) scan that involves illuminating and imaging a common point, and scanning this point in a circular path along a wall. We observe that (1) these C$^2$NLOS measurements consist of a superposition of sinusoids, which we refer to as a transient sinogram, (2) there exists computationally efficient reconstruction procedures that transform these sinusoidal measurements into 3D positions of hidden scatterers or NLOS images of hidden objects, and (3) despite operating on an order of magnitude fewer measurements than previous approaches, these C$^2$NLOS scans provide sufficient information about the hidden scene to solve these different NLOS imaging tasks. We show results from both simulated and real C$^2$NLOS scans."



Paperid:288
Authors:Yulun Zhang, Zhifei Zhang, Stephen DiVerdi, Zhaowen Wang, Jose Echevarria, Yun Fu
Title: Texture Hallucination for Large-Factor Painting Super-Resolution
Abstract:
We aim to super-resolve digital paintings, synthesizing realistic details from high-resolution reference painting materials for very large scaling factors (g 8$ imes$, 16$ imes$). However, previous single image super-resolution (SISR) methods would either lose textural details or introduce unpleasing artifacts. On the other hand, reference-based SR (Ref-SR) methods can transfer textures to some extent, but is still impractical to handle very large factors and keep fidelity with original input. To solve these problems, we propose an efficient high-resolution hallucination network for very large scaling factors with efficient network structure and feature transferring. To transfer more detailed textures, we design a wavelet texture loss, which helps to enhance more high-frequency components. At the same time, to reduce the smoothing effect brought by the image reconstruction loss, we further relax the reconstruction constraint with a degradation loss which ensures the consistency between downscaled super-resolution results and low-resolution inputs. We also collected a high-resolution (g 4K resolution) painting dataset PaintHD by considering both physical size and image resolution. We demonstrate the effectiveness of our method with extensive experiments on PaintHD by comparing with SISR and Ref-SR state-of-the-art methods."



Paperid:289
Authors:Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, Xu Yang, Yiheng Zhu, Xiaohui Shen, Ding Liu, Jing Liu, Nadia Magnenat Thalmann
Title: Learning Progressive Joint Propagation for Human Motion Prediction
Abstract:
Despite the great progress in human motion prediction, it remains a challenging task due to the complicated structural dynamics of human behaviors. In this paper, we address this problem in three aspects. First, to capture the long-range spatial correlations and temporal dependencies, we apply a transformer-based architecture with the global attention mechanism. Specifically, we feed the network with the sequential joints encoded with the temporal information for spatial and temporal explorations. Second, to further exploit the inherent kinematic chains for better 3D structures, we apply a progressive-decoding strategy, which performs in a central-to-peripheral extension according to the structural connectivity. Last, in order to incorporate a general motion space for high-quality prediction, we build a memory-based dictionary, which aims to preserve the global motion patterns in training data to guide the predictions. We evaluate the proposed method on two challenging benchmark datasets (Human3.6M and CMU-Mocap). Experimental results show our superior performance compared with the state-of-the-art approaches."



Paperid:290
Authors:Bingbing Zhuang, Quoc-Huy Tran
Title: Image Stitching and Rectification for Hand-Held Cameras
Abstract:
In this paper, we derive a new differential homography that can account for the scanline-varying camera poses in Rolling Shutter (RS) cameras, and demonstrate its application to carry out RS-aware image stitching and rectification at one stroke. Despite the high complexity of RS geometry, we focus in this paper on a special yet common input --- two consecutive frames from a video stream, wherein the inter-frame motion is restricted from being arbitrarily large. This allows us to adopt simpler differential motion model, leading to a straightforward and practical minimal solver. To deal with non-planar scene and camera parallax in stitching, we further propose an RS-aware spatially-varying homogarphy field in the principle of As-Projective-As-Possible (APAP). We show superior performance over state-of-the-art methods both in RS image stitching and rectification, especially for images captured by hand-held shaking cameras."



Paperid:291
Authors:Gopal Sharma, Difan Liu, Subhransu Maji, Evangelos Kalogerakis, Siddhartha Chaudhuri, Radomír Měch
Title: ParSeNet: A Parametric Surface Fitting Network for 3D Point Clouds
Abstract:
We propose a novel, end-to-end trainable, deep network called ParSeNet that decomposes a 3D point cloud into parametric surface patches, including B-spline patches as well as basic geometric primitives. ParSeNet is trained on a large-scale dataset of man-made 3D shapes and captures high-level semantic priors for shape decomposition. It handles a much richer class of primitives than prior work, and allows us to represent surfaces with higher fidelity. It also produces repeatable and robust parametrizations of a surface compared to purely geometric approaches. We present extensive experiments to validate our approach against analytical and learning-based alternatives. Our source code is publicly available at: https://hippogriff.github.io/parsenet."



Paperid:292
Authors:Ismail Elezi, Sebastiano Vascon, Alessandro Torcinovich, Marcello Pelillo, Laura Leal-Taixé
Title: The Group Loss for Deep Metric Learning
Abstract:
Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or triplets of samples within a mini-batch to compute the loss function, which is commonly based on the distance between embeddings. We propose Group Loss, a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. Guided by the smoothness assumption that ``""similar objects should belong to the same group"", the proposed loss trains the neural network for a classification task, enforcing a consistent labelling amongst samples within a class. We show state-of-the-art results on clustering and image retrieval on several datasets, and show the potential of our method when combined with other techniques such as ensembles."



Paperid:293
Authors:Brent A. Griffin, Jason J. Corso
Title: Learning Object Depth from Camera Motion and Video Object Segmentation
Abstract:
Video object segmentation, i.e., the separation of a target object from background in video, has made significant progress on real and challenging videos in recent years. To leverage this progress in 3D applications, this paper addresses the problem of learning to estimate the depth of segmented objects given some measurement of camera motion (e.g., from robot kinematics or vehicle odometry). We achieve this by, first, introducing a diverse, extensible dataset and, second, designing a novel deep network that estimates the depth of objects using only segmentation masks and uncalibrated camera movement. Our data-generation framework creates artificial object segmentations that are scaled for changes in distance between the camera and object, and our network learns to estimate object depth even with segmentation errors. We demonstrate our approach across domains using a robot camera to locate objects from the YCB dataset and a vehicle camera to locate obstacles while driving."



Paperid:294
Authors:Zhiqiang Tang, Yunhe Gao, Leonid Karlinsky, Prasanna Sattigeri, Rogerio Feris, Dimitris Metaxas
Title: OnlineAugment: Online Data Augmentation with Less Domain Knowledge
Abstract:
Data augmentation is one of the most important tools in training modern deep neural networks. Recently, great advances have been made in searching for optimal augmentation policies in the image classification domain. However, two key points related to data augmentation remain uncovered by the current methods. First is that most if not all modern augmentation search methods are extit{offline} where learning policies are isolated from their usage. The learned policies are mostly constant throughout the training process and are extit{not adapted} to the current training model state. Second, the policies rely on class-preserving image processing functions. Hence applying current offline methods to new tasks may require domain knowledge to specify such kind of operations. In this work, we offer an orthogonal extit{online} data augmentation scheme together with three new augmentation networks, co-trained with the target learning task. It is both more efficient, in the sense that it does not require expensive offline training when entering a new domain, and more adaptive as it adapts to the learner state. Our augmentation networks require less domain knowledge and are easily applicable to new tasks. Extensive experiments demonstrate that the proposed scheme alone performs on par with the state-of-the-art offline data augmentation methods, as well as improving upon the state-of-the-art in combination with those methods."



Paperid:295
Authors:Yiming Qian, Yasutaka Furukawa
Title: Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction
Abstract:
This paper proposes a novel single-image piecewise planar reconstruction technique that infers and enforces inter-plane relationships. Our approach takes a planar reconstruction result from an existing system, then utilizes convolutional neural network (CNN) to (1) classify if two planes are orthogonal or parallel; and 2) infer if two planes are touching and, if so, where in the image. We formulate an optimization problem to refine plane parameters and employ a message passing neural network to refine plane segmentation masks by enforcing the inter-plane relations. Our qualitative and quantitative evaluations demonstrate the effectiveness of the proposed approach in terms of plane parameters and segmentation accuracy."



Paperid:296
Authors:Yukang Wang, Wei Zhou, Tao Jiang, Xiang Bai, Yongchao Xu
Title: Intra-class Feature Variation Distillation for Semantic Segmentation
Abstract:
Current state-of-the-art semantic segmentation methods usually require high computational resources for accurate segmentation. One promising way to achieve a good trade-off between segmentation accuracy and efficiency is knowledge distillation. In this paper, different from previous methods performing knowledge distillation for densely pairwise relations, we propose a novel intra-class feature variation distillation (IFVD) to transfer the intra-class feature variation (IFV) of the cumbersome model (teacher) to the compact model (student). Concretely, we compute the feature center (regarded as the prototype) of each class and characterize the IFV with the set of similarity between the feature on each pixel and its corresponding class-wise prototype. The teacher model usually learns more robust intra-class feature representation than the student model, making them have different IFV. Transferring such IFV from teacher to student could make the student mimic the teacher better in terms of feature distribution, and thus improve the segmentation accuracy. We evaluate the proposed approach on three widely adopted benchmarks: Cityscapes, CamVid and Pascal VOC 2012, consistently improving state-of-the-art methods."



Paperid:297
Authors:Junwu Weng, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Xudong Jiang, Junsong Yuan
Title: Temporal Distinct Representation Learning for Action Recognition
Abstract:
Motivated by the previous success of Two-Dimensional Convolutional Neural Network (2D CNN) on image recognition, researchers endeavor to leverage it to characterize videos. However, one limitation of applying 2D CNN to analyze videos is that different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization, especially in the spatial semantics extraction process, hence neglecting the critical variations among frames. In this paper, we attempt to tackle this issue through two ways. 1) Design a sequential channel filtering mechanism, Progressive Enhancement Module (PEM), to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction. 2) Create a Temporal Diversity Loss (TD Loss) to force the kernels to concentrate on and capture the variations among frames rather than the image regions with similar appearance. Our method is evaluated on the temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements against the best competitor by 2.4% and 1.3%. Besides, performance improvements over the 2D-CNN-based state-of-the-arts on the large-scale dataset Kinetics are also witnessed."



Paperid:298
Authors:Changqian Yu, Yifan Liu, Changxin Gao, Chunhua Shen, Nong Sang
Title: Representative Graph Neural Network
Abstract:
Non-local operation is widely explored to model the long-range dependencies. However, the redundant computation in this operation leads to a prohibitive complexity. In this paper, we present a Representative Graph (RepGraph) layer to dynamically sample a few representative features, which dramatically reduces redundancy. Instead of propagating the messages from all positions, our RepGraph layer computes the response of one node merely with a few representative nodes. The locations of representative nodes come from a learned spatial offset matrix. The RepGraph layer is flexible to integrate into many visual architectures and combine with other operations. With the application of semantic segmentation, without any bells and whistles, our RepGraph network can compete or perform favourably against the state-of-the-art methods on three challenging benchmarks: ADE20K, Cityscapes, and PASCAL-Context datasets. In the task of object detection, our RepGraph layer can also improve the performance on the COCO dataset compared to the non-local operation."



Paperid:299
Authors:Mikaela Angelina Uy, Jingwei Huang, Minhyuk Sung, Tolga Birdal, Leonidas Guibas
Title: Deformation-Aware 3D Model Embedding and Retrieval
Abstract:
We introduce a new problem of mph{retrieving} 3D models that are mph{deformable} to a given query shape and present a novel deep mph{deformation-aware} embedding to solve this retrieval task. 3D model retrieval is a fundamental operation for recovering a clean and complete 3D model from a noisy and partial 3D scan. However, given a finite collection of 3D shapes, even the closest model to a query may not be satisfactory. This motivates us to apply 3D model deformation techniques to adapt the retrieved model so as to better fit the query. Yet, certain restrictions are enforced in most 3D deformation techniques to preserve important features of the original model that prevent a perfect fitting of the deformed model to the query. This gap between the deformed model and the query induces mph{asymmetric} relationships among the models, which cannot be handled by typical metric learning techniques. Thus, to retrieve the best models for fitting, we propose a novel deep embedding approach that learns the asymmetric relationships by leveraging location-dependent egocentric distance fields. We also propose two strategies for training the embedding network. We demonstrate that both of these approaches outperform other baselines in our experiments with both synthetic and real data. Our project page can be found at \href{https://deformscan2cad.github.io/}{deformscan2cad.github.io}."



Paperid:300
Authors:Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, Vijay Badrinarayanan, Andrew Rabinovich
Title: Atlas: End-to-End 3D Scene Reconstruction from Posed Images
Abstract:
We present an end-to-end 3D reconstruction of a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input."



Paperid:301
Authors:Poojan Oza, Hien V. Nguyen, Vishal M. Patel
Title: Multiple Class Novelty Detection Under Data Distribution Shift
Abstract:
The novelty detection models learn a decision boundary around multiple categories of a given dataset. This helps such models in detecting any novel classes encountered during testing. However, in many cases, the test data distribution can be different from that of the training data. For such cases, the novelty detection models risk detecting a known class as novel due to the dataset distribution shift. This scenario is often ignored while working with novelty detection. To this end, we consider the problem of multiple class novelty detection under dataset distribution shift to improve the novelty detection performance. Firstly, we discuss the problem setting in detail and show how it affects the performance of current novelty detection methods. Secondly, we show that one could improve those novelty detection methods with a simple integration of domain adversarial loss. Finally, we propose a method which brings together the techniques from novelty detection and domain adaptation to improve generalization of multiple class novelty detection on different domains. We evaluate the proposed method on digits and object recognition datasets and show that it provides improvements over the baseline methods."



Paperid:302
Authors:Chung-Sheng Lai, Zunzhi You, Ching-Chun Huang, Yi-Hsuan Tsai, Wei-Chen Chiu
Title: Colorization of Depth Map via Disentanglement
Abstract:
Vision perception is one of the most important components for a computer or robot to understand the surrounding scene and achieve autonomous applications. However, most of the vision models are based on the RGB sensors, which in general are vulnerable to the insufficient lighting condition. In contrast, the depth camera, another widely-used visual sensor, is capable of perceiving 3D information and being more robust to the lack of illumination, but unable to obtain appearance details of the surrounding environment compared to RGB cameras. To make RGB-based vision models workable for the low-lighting scenario, prior methods focus on learning the colorization on depth maps captured by depth cameras, such that the vision models can still achieve reasonable performance on colorized depth maps. However, the colorization produced in this manner is usually unrealistic and constrained to the specific vision model, thus being hard to generalize for other tasks to use. In this paper, we propose a depth map colorization method via disentangling appearance and structure factors, so that our model could 1) learn depth-invariant appearance features from an appearance reference and 2) generate colorized images by combining a given depth map and the appearance feature obtained from any reference. We conduct extensive experiments to show that our colorization results are more realistic and diverse in comparison to several image translation baselines."



Paperid:303
Authors:Johanna Wald, Torsten Sattler, Stuart Golodetz, Tommaso Cavallari, Federico Tombari
Title: Beyond Controlled Environments: 3D Camera Re-Localization in Changing Indoor Scenes
Abstract:
Long-term camera re-localization is an important task with numerous computer vision and robotics applications. Whilst various outdoor benchmarks exist that target lighting, weather and seasonal changes, far less attention has been paid to appearance changes that occur indoors. This has led to a mismatch between popular indoor benchmarks, which focus on static scenes, and indoor environments that are of interest for many real-world applications. In this paper, we adapt 3RScan -- a recently introduced indoor RGB-D dataset designed for object instance re-localization -- to create RIO10, a new long-term camera re-localization benchmark focused on indoor scenes. We propose new metrics for evaluating camera re-localization and explore how state-of-the-art camera re-localizers perform according to these metrics. We also examine in detail how different types of scene change affect the performance of different methods, based on novel ways of detecting such changes in a given RGB-D frame. Our results clearly show that long-term indoor re-localization is an unsolved problem. Our benchmark and tools are publicly available at waldjohannau.github.io/RIO10"



Paperid:304
Authors:Ahmed Samy Nassar, Stefano D’Aronco, Sébastien Lefèvre, Jan D. Wegner
Title: GeoGraph: Graph-based multi-view object detection with geometric cues end-to-end
Abstract:
In this paper, we propose an end-to-end learnable approach that detects static urban objects from multiple views, re-identifies instances, and finally assigns a geographic position per object. Our method relies on a Graph Neural Network (GNN) to, detect all objects and out-put their geographic positions given images and approximate camera poses as input. Our GNN simultaneously models relative pose and image evidence and is further able to deal with an arbitrary number of input views. Our method is robust to occlusion, with a similar appearance of neighboring objects, and severe changes in viewpoints by jointly reasoning about visual image appearance and relative pose. Experimental evaluation on two challenging, large-scale datasets and comparison with state-of-the-art methods show significant and systematic improvements both in accuracy and efficiency, with 2-6% gain in the detection and re-IDaverage precision as well as 8x reduction of training time."



Paperid:305
Authors:Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G. M. Snoek
Title: Localizing the Common Action Among a Few Videos
Abstract:
This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (i) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (ii) a progressive alignment module that iteratively fuses the support videos into the query branch; and (iii) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal."



Paperid:306
Authors:Moshe Lichtenstein, Prasanna Sattigeri, Rogerio Feris, Raja Giryes, Leonid Karlinsky
Title: TAFSSL: Task-Adaptive Feature Sub-Space Learning for few-shot classification
Abstract:
The field of Few-Shot Learning (FSL), or learning from very few (typically $1$ or $5$) examples per novel class (unseen during training), has received a lot of attention and significant performance advances in the recent literature. While number of techniques have been proposed for FSL, several factors have emerged as most important for FSL performance, awarding SOTA even to the simplest of techniques. These are: the backbone architecture (bigger is better), type of pre-training on the base classes (meta-training vs regular multi-class, currently regular wins), quantity and diversity of the base classes set (the more the merrier, resulting in richer and better adaptive features), and the use of self-supervised tasks during pre-training (serving as a proxy for increasing the diversity of the base set). In this paper we propose yet another simple technique that is important for the few shot learning performance - a search for a compact feature sub-space that is discriminative for a given few-shot test task. We show that the Task-Adaptive Feature Sub-Space Learning (TAFSSL) can significantly boost the performance in FSL scenarios when some additional unlabeled data accompanies the novel few-shot task, be it either the set of unlabeled queries (transductive FSL) or some additional set of unlabeled data samples (semi-supervised FSL). Specifically, we show that on the challenging extit{mini}ImageNet and extit{tiered}ImageNet benchmarks, TAFSSL can improve the current state-of-the-art in both transductive and semi-supervised FSL settings by more than $5\%$, while increasing the benefit of using unlabeled data in FSL to above $10\%$ performance gain."



Paperid:307
Authors:Tackgeun You, Bohyung Han
Title: Traffic Accident Benchmark for Causality Recognition
Abstract:
We propose a brand new benchmark for analyzing causality in traffic accident videos by decomposing an accident into a pair of events, cause and effect. We collect videos containing traffic accident scenes and annotate cause and effect events for each accident with their temporal intervals and semantic labels; such annotations are not available in existing datasets for accident anticipation task. Our dataset has the following two advantages over the existing ones, which would facilitate practical research for causality analysis. First, the decomposition of an accident into cause and effect events provides atomic cues for reasoning on a complex environment and planning future actions. Second, the prediction of cause and effect in an accident makes a system more interpretable to humans, which mitigates the ambiguity of legal liabilities among agents engaged in the accident. Using the proposed dataset, we analyze accidents by localizing the temporal intervals of their causes and effects and classifying the semantic labels of the accidents. The dataset as well as the implementations of baseline models are available in the code repository."



Paperid:308
Authors:Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, Guoying Zhao
Title: Face Anti-Spoofing with Human Material Perception
Abstract:
Face anti-spoofing (FAS) plays a vital role in securing the face recognition systems from presentation attacks. Most existing FAS methods capture various cues (e.g., texture, depth and reflection) to distinguish the live faces from the spoofing faces. All these cues are based on the discrepancy among physical materials (e.g., skin, glass, paper and silicone). In this paper we rephrase face anti-spoofing as a material recognition problem and combine it with classical human material perception, intending to extract discriminative and robust features for FAS. To this end, we propose the Bilateral Convolutional Networks (BCN), which is able to capture intrinsic material-based patterns via aggregating multi-level bilateral macro- and micro- information. Furthermore, Multi-level Feature Refinement Module (MFRM) and multi-head supervision are utilized to learn more robust features. Comprehensive experiments are performed on six benchmark datasets, and the proposed method achieves superior performance on both intra- and cross-dataset testings. One highlight is that we achieve overall 11.3$\pm$9.5\% EER for cross-type testing in SiW-M dataset, which significantly outperforms previous results. We hope this work will facilitate future cooperation between FAS and material communities. "



Paperid:309
Authors:Huikun Bi, Ruisi Zhang, Tianlu Mao, Zhigang Deng, Zhaoqi Wang
Title: How Can I See My Future? FvTraj: Using First-person View for Pedestrian Trajectory Prediction
Abstract:
This work presents a novel First-person View based Trajectory predicting model (FvTraj) to estimate the future trajectories of pedestrians in a scene given their observed trajectories and the corresponding first-person view images. First, we render first-person view images using our in-house built First-person View Simulator (FvSim), given the ground-level 2D trajectories. Then, based on multi-head attention mechanisms, we design a social-aware attention module to model social interactions between pedestrians, and a view-aware attention module to capture the relations between historical motion states and visual features from the first-person view images. Our results show the dynamic scene contexts with ego-motions captured by first-person view images via FvSim are valuable and effective for trajectory prediction. Using this simulated first-person view images, our well structured FvTraj model achieves state-of-the-art performance."



Paperid:310
Authors:Yunpeng Zhai, Qixiang Ye, Shijian Lu, Mengxi Jia, Rongrong Ji, Yonghong Tian
Title: Multiple Expert Brainstorming for Domain Adaptive Person Re-identification
Abstract:
Often the best performing deep neural models are ensembles of multiple base-level networks, nevertheless, ensemble learning with respect to domain adaptive person re-ID remains unexplored. In this paper, we propose a multiple expert brainstorming network (MEB-Net) for domain adaptive person re-ID, opening up a promising direction about model ensemble problem under unsupervised conditions. MEB-Net adopts a mutual learning strategy, where multiple networks with different architectures are pre-trained within a source domain as expert models equipped with specific features and knowledge, while the adaptation is then accomplished through brainstorming (mutual learning) among expert models. MEB-Net accommodates the heterogeneity of experts learned with different architectures and enhances discrimination capability of the adapted re-ID model, by introducing a regularization scheme about authority of experts. Extensive experiments on large-scale datasets (Market-1501 and DukeMTMC-reID) demonstrate the superior performance of MEB-Net over the state-of-the-arts."



Paperid:311
Authors:Boyang Deng, JP Lewis, Timothy Jeruzalski, Gerard Pons-Moll, Geoffrey Hinton, Mohammad Norouzi, Andrea Tagliasacchi
Title: NASA Neural Articulated Shape Approximation
Abstract:
Efficient representation of articulated objects such as human bodies is an important problem in computer vision and graphics. To efficiently simulate deformation, existing approaches represent 3D objects using polygonal meshes and deform them using skinning techniques. This paper introduces neural articulated shape approximation (NASA), an alternative framework that enables efficient representation of articulated deformable objects using neural indicator functions that are conditioned on pose. Occupancy testing using NASA is straightforward, circumventing the complexity of meshes and the issue of water-tightness. We demonstrate the effectiveness of NASA for 3D tracking applications, and discuss other potential extensions."



Paperid:312
Authors:Zeyu Wang, Berthy Feng, Karthik Narasimhan, Olga Russakovsky
Title: Towards Unique and Informative Captioning of Images
Abstract:
Despite considerable progress, state of the art image captioning models produce generic captions, leaving out important image details. Furthermore, these systems may even misrepresent the image in order to produce a simpler caption consisting of common concepts. In this paper, we first analyze both modern captioning systems and evaluation metrics through empirical experiments to quantify these phenomena. We find that modern captioning systems return higher likelihoods for incorrect distractor sentences compared to ground truth captions, and that evaluation metrics like SPICE can be 'topped' using simple captioning systems relying on object detectors. Inspired by these observations, we design a new metric (SPICE-U) by introducing a notion of uniqueness over the concepts generated in a caption. We show that SPICE-U is better correlated with human judgements compared to SPICE, and effectively captures notions of diversity and descriptiveness. Finally, we also demonstrate a general technique to improve any existing captioning model -- by using mutual information as a re-ranking objective during decoding. Empirically, this results in more unique and informative captions, and improves three different state-of-the-art models on SPICE-U as well as average score over existing metrics."



Paperid:313
Authors:Jong-Chyi Su, Subhransu Maji, Bharath Hariharan
Title: When Does Self-supervision Improve Few-shot Learning?
Abstract:
We investigate the role of self-supervised learning (SSL) in the context of few-shot learning. Although recent research has shown the benefits of SSL on large unlabeled datasets, its utility on small datasets is relatively unexplored. We find that SSL reduces the relative error rate of few-shot meta-learners by 4%-27%, even when the datasets are small and only utilizing images within the datasets. The improvements are greater when the training set is smaller or the task is more challenging. Although the benefits of SSL may increase with larger training sets, we observe that SSL can hurt the performance when the distributions of images used for meta-learning and SSL are different. We conduct a systematic study by varying the degree of domain shift and analyzing the performance of several meta-learners on a multitude of domains. Based on this analysis we present a technique that automatically selects images for SSL from a large, generic pool of unlabeled images for a given dataset that provides further improvements."



Paperid:314
Authors:Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, Wael AbdAlmageed
Title: Two-branch Recurrent Network for Isolating Deepfakes in Videos
Abstract:
The current spike of hyper-realistic faces artificially generated using deepfakes calls for media forensics solutions that are tailored to video streams and work reliably with a low false alarm rate at the video level. We present a method for deepfake detection based on a two-branch network structure that isolates digitally manipulated faces by learning to amplify artifacts while suppressing the high-level face content. Unlike current methods that extract spatial frequencies as a preprocessing step, we propose a two-branch structure: one branch propagates the original information, while the other branch suppresses the face content yet amplifies multi-band frequencies using a Laplacian of Gaussian (LoG) as a bottleneck layer. To better isolate manipulated faces, we derive a novel cost function that, unlike regular classification, compresses the variability of natural faces and pushes away the unrealistic facial samples in the feature space. Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks, when compared to prior work. We then offer a full, detailed ablation study of our network architecture and cost function. Finally, although the bar is still high to get very remarkable figures at a very low false alarm rate, our study shows that we can achieve good video-level performance when cross-testing in terms of video-level AUC."



Paperid:315
Authors:Qing Liu, Orchid Majumder, Alessandro Achille, Avinash Ravichandran, Rahul Bhotika, Stefano Soatto
Title: Incremental Few-Shot Meta-Learning via Indirect Discriminant Alignment
Abstract:
We propose a method to train a model so it can learn new classification tasks while improving with each task solved. This amounts to combining meta-learning with incremental learning. Different tasks can have disjoint classes, so one cannot directly align different classifiers as done in model distillation. On the other hand, simply aligning features shared by all classes does not allow the base model sufficient flexibility to evolve to solve new tasks. We therefore indirectly align features relative to a minimal set of anchor classes. Such indirect discriminant alignment (IDA) adapts a new model to old classes without the need to re-process old data, while leaving maximum flexibility for the model to adapt to new tasks. This process enables incrementally improving the model by processing multiple learning episodes, each representing a different learning task, even with few training examples. Experiments on few-shot learning benchmarks show that this incremental approach performs favorably compared to training the model with the entire dataset at once. "



Paperid:316
Authors:Jiahui Yu, Pengchong Jin, Hanxiao Liu, Gabriel Bender, Pieter-Jan Kindermans, Mingxing Tan, Thomas Huang, Xiaodan Song, Ruoming Pang, Quoc Le
Title: BigNAS: Scaling Up Neural Architecture Search with Big Single-Stage Models
Abstract:
Neural architecture search (NAS) methods have shown promising results discovering models that are both accurate and fast. For NAS, training a one-shot model has became a popular strategy to rank the relative quality of different architectures (child models) using a single set of shared weights. However, while one-shot model weights can effectively rank different network architectures, the absolute accuracies from these shared weights are typically far below those obtained from stand-alone training. To compensate, existing methods assume that the weights must be retrained, finetuned, or otherwise post-processed after the search is completed. These steps significantly increase the compute requirements and complexity of the architecture search and model deployment. In this work, we propose BigNAS, an approach that challenges the conventional wisdom that post-processing of the weights is necessary to get good prediction accuracies. Without extra retraining or post-processing steps, we are able to train a single set of shared weights on ImageNet and use these weights to obtain child models whose sizes range from 200 to 1000 MFLOPs. Our discovered model family, BigNASModels, achieve top-1 accuracies ranging from 76.5% to 80.9%, surpassing state-of-the-art models in this range including EfficientNets and Once-for-All networks without extra retraining or post-processing. We present ablative study and analysis to further understand the proposed BigNASModels."



Paperid:317
Authors:Sheng Jin, Wentao Liu, Enze Xie, Wenhai Wang, Chen Qian, Wanli Ouyang, Ping Luo
Title: Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation
Abstract:
Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously. Previous methods can be divided into two streams, \ie top-down and bottom-up methods. The top-down methods localize keypoints after human detection, while the bottom-up methods localize keypoints directly and then cluster/group them for different persons, which are generally more efficient than top-down methods. However, in existing bottom-up methods, the keypoint grouping is usually solved independently from keypoint detection, making them not end-to-end trainable and have sub-optimal performance. In this paper, we investigate a new perspective of human part grouping and reformulate it as a graph clustering task. Especially, we propose a novel differentiable Hierarchical Graph Grouping (HGG) method to learn the graph grouping in bottom-up multi-person pose estimation task. Moreover, HGG is easily embedded into main-stream bottom-up methods. It takes human keypoint candidates as graph nodes and clusters keypoints in a multi-layer graph neural network model. The modules of HGG can be trained end-to-end with the keypoint detection network and is able to supervise the grouping process in a hierarchical manner. To improve the discrimination of the clustering, we add a set of edge discriminators and macro-node discriminators. Extensive experiments on both COCO and OCHuman datasets demonstrate that the proposed method improves the performance of bottom-up pose estimation methods."



Paperid:318
Authors:Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
Title: Global Distance-distributions Separation for Unsupervised Person Re-identification
Abstract:
Supervised person re-identification (ReID) often has poor scalability and usability in real-world deployments due to domain gaps and the lack of annotations for the target domain data. Unsupervised person ReID through domain adaptation is attractive yet challenging. Existing unsupervised ReID approaches often fail in correctly identifying the positive samples and negative samples through the distance-based matching/ranking. The two distributions of distances for positive sample pairs (Pos-distr) and negative sample pairs (Neg-distr) are often not well separated, having large overlap. To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view. We model the two global distance distributions as Gaussian distributions and push apart the two distributions while encouraging their sharpness in the unsupervised training process. Particularly, to model the distributions from a global view and facilitate the timely updating of the distributions and the GDS related losses, we leverage a momentum update mechanism for building and maintaining the distribution parameters (mean and variance) and calculate the loss on the fly during the training. Distribution-based hard mining is proposed to further promote the separation of the two distributions. We validate the effectiveness of the GDS constraint in unsupervised ReID networks. Extensive experiments on multiple ReID benchmark datasets show our method leads to significant improvement over the baselines and achieves the state-of-the-art performance."



Paperid:319
Authors:Gyeongsik Moon, Kyoung Mu Lee
Title: I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image
Abstract:
Most of the previous image-based 3D human pose and mesh estimation methods estimate parameters of the human mesh model from an input image. However, directly regressing the parameters from the input image is a highly non-linear mapping because it breaks the spatial relationship between pixels in the input image. In addition, it cannot model the prediction uncertainty, which can make training harder. To resolve the above issues, we propose I2L-MeshNet, an image-to-lixel (line+pixel) prediction network. The proposed I2L-MeshNet predicts the per-lixel likelihood on 1D heatmaps for each mesh vertex coordinate instead of directly regressing the parameters. Our lixel-based 1D heatmap preserves the spatial relationship in the input image and models the prediction uncertainty. We show that the proposed I2L-MeshNet significantly outperforms previous methods while providing visually pleasant mesh estimation results. The code is publicly available ootnote{\url{https://github.com/mks0601/I2L-MeshNet_RELEASE}}."



Paperid:320
Authors:Hongsuk Choi, Gyeongsik Moon, Kyoung Mu Lee
Title: Pose2Mesh: Graph Convolutional Network for 3D Human Pose and Mesh Recovery from a 2D Human Pose
Abstract:
Most of the recent deep learning-based 3D human pose and mesh estimation methods regress the pose and shape parameters of human mesh models, such as SMPL and MANO, from an input image. The first weakness of these methods is the overfitting to image appearance, due to the domain gap between the training data captured from controlled settings such as a lab, and in-the-wild data in inference time. The second weakness is that the estimation of the pose parameters is quite challenging due to the representation issues of 3D rotations. To overcome the above weaknesses, we propose Pose2Mesh, a novel graph convolutional neural network (GraphCNN)-based system that estimates the 3D coordinates of human {m mesh vertices} directly from the {m 2D human pose}. The 2D human pose as an input provides essential human body articulation information without image appearance. Also, the proposed system avoids the representation issues, while fully exploiting the mesh topology using GraphCNN in a coarse-to-fine manner. We show that our Pose2Mesh significantly outperforms the previous 3D human pose and mesh estimation methods on various benchmark datasets. The codes are publicly available ootnote{\url{https://github.com/hongsukchoi/Pose2Mesh_RELEASE}}."



Paperid:321
Authors:Mingzhu Zhu, Zhang Gao, Junzhi Yu, Bingwei He, Jiantao Liu
Title: ALRe: Outlier Detection for Guided Refinement
Abstract:
Guided refinement is a popular procedure of various image post-processing applications. It produces output image based on input and guided images. Input images are usually flawed estimates containing kinds of noises and outliers, which undermine the edge consistency between input and guided images. As improvements, they are refined into output images with similar intensities of input images and consistent edges of guided images. However, outliers are usually untraceable and simply treated as zero-mean noises, limiting the quality of such refinement. In this paper, we propose a general outlier detection method for guided refinement. We assume local linear relationship between output and guided images to express the expected edge consistency, based on which, the outlier likelihoods of input pixels are measured. The metric is termed as ALRe (anchored linear residual) since it is essentially the residual of local linear regression with an equality constraint exerted on the measured pixel. Valuable features of the ALRe are discussed. Its effectiveness is proven by applications and experiment."



Paperid:322
Authors:Yifan Yang, Guorong Li, Zhe Wu, Li Su, Qingming Huang, Nicu Sebe
Title: Weakly-Supervised Crowd Counting Learns from Sorting rather than Locations
Abstract:
In crowd counting datasets, the location labels are costly, yet, they are not taken into the evaluation metrics. Besides, existing multi-task approaches employ high-level tasks to improve counting accuracy. This research tendency increases the demand for more annotations. In this paper, we propose a weakly-supervised counting network, which directly regresses the crowd numbers without the location supervision. Moreover, we train the network to count by exploiting the relationship among the images. We propose a soft-label sorting network along with the counting network, which sorts the given images by their crowd numbers. The sorting network drives the shared backbone CNN model to obtain density-sensitive ability explicitly. Therefore, the proposed method improves the counting accuracy by utilizing the information hidden in crowd numbers, rather than learning from extra labels, such as locations and perspectives.We evaluate our proposed method on three crowd counting datasets, and the performance of our method plays favorably against the fully supervised state-of-the-art approaches."



Paperid:323
Authors:Wen Ji, Kelei He, Jing Huo, Zheng Gu, Yang Gao
Title: Unsupervised Domain Attention Adaptation Network for Caricature Attribute Recognition
Abstract:
Caricature attributes provide distinctive facial features to help research in Psychology and Neuroscience. However, unlike the facial photo attribute datasets that have a quantity of annotated images, the annotations of caricature attributes are rare. To facility the research in attribute learning of caricatures, we propose a caricature attribute dataset, namely WebCariA. Moreover, to utilize models that trained by face attributes, we propose a novel unsupervised domain adaptation framework for cross-modality (i.e., photos to caricatures) attribute recognition, with an integrated inter- and intra-domain consistency learning scheme. Specifically, the inter-domain consistency learning scheme consisting an image-to-image translator to first fill the domain gap between photos and caricatures by generating intermediate image samples, and a label consistency learning module to align their semantic information. The intra-domain consistency learning scheme integrates the common feature consistency learning module with a novel attribute-aware attention-consistency learning module for a more efficient alignment. We did an extensive ablation study to show the effectiveness of the proposed method. And the proposed method also outperforms the state-of-the-art methods by a margin. The implementation of the proposed method is available at https://github.com/KeleiHe/DAAN."



Paperid:324
Authors:Carlo Biffi, Steven McDonagh, Philip Torr, Aleš Leonardis, Sarah Parisot
Title: Many-shot from Low-shot: Learning to Annotate using Mixed Supervision for Object Detection
Abstract:
Object detection has witnessed significant progress by relying on large, manually annotated datasets. Annotating such datasets is highly time consuming and expensive, which motivates the development of weakly supervised and few-shot object detection methods. However, these methods largely underperform with respect to their strongly supervised counterpart, as weak training signals mph{often} result in partial or oversized detections. Towards solving this problem we introduce, for the first time, an online annotation module (OAM) that learns to generate a many-shot set of mph{reliable} annotations from a larger volume of weakly labelled images. Our OAM can be jointly trained with any fully supervised two-stage object detection method, providing additional training annotations on the fly. This results in a fully end-to-end strategy that only requires a low-shot set of fully annotated images. The integration of the OAM with Fast(er) R-CNN improves their performance by $17\%$ mAP, $9\%$ AP50 on PASCAL VOC 2007 and MS-COCO benchmarks, and significantly outperforms competing methods using mixed supervision."



Paperid:325
Authors:Yueqi Duan, Haidong Zhu, He Wang, Li Yi Ram Nevatia, Leonidas J. Guibas
Title: Curriculum DeepSDF
Abstract:
When learning to sketch, beginners start with simple and flexible shapes, and then gradually strive for more complex and accurate ones in the subsequent training sessions. In this paper, we design a ``""shape curriculum'' for learning continuous Signed Distance Function (SDF) on shapes, namely Curriculum DeepSDF. Inspired by how humans learn, Curriculum DeepSDF organizes the learning task in ascending order of difficulty according to the following two criteria: surface accuracy and sample difficulty. The former considers stringency in supervising with ground truth, while the latter regards the weights of hard training samples near complex geometry and fine structure. More specifically, Curriculum DeepSDF learns to reconstruct coarse shapes at first, and then gradually increases the accuracy and focuses more on complex local details. Experimental results show that a carefully-designed curriculum leads to significantly better shape reconstructions with the same training data, training epochs and network architecture as DeepSDF. We believe that the application of shape curricula can benefit the training process of a wide variety of 3D shape representation learning methods."



Paperid:326
Authors:Minghua Liu, Xiaoshuai Zhang, Hao Su
Title: Meshing Point Clouds with Predicted Intrinsic-Extrinsic Ratio Guidance
Abstract:
We are interested in reconstructing the mesh representation of object surfaces from a point cloud. Surface reconstruction is a prerequisite for down-stream applications such as rendering, collision avoidance for planning, animation, etc. However, the task is challenging if the input point cloud has a low resolution, which is common in real-world scenarios (e.g., from LiDAR or Kinect sensors). Existing learning-based mesh generative methods mostly predict the surface by first building a shape embedding that is at the whole object level, a design that causes issues in generating fine-grained details and generalizing to unseen categories. Instead, we propose to leverage the input point cloud as much as possible, by only adding connectivity information to existing points. Particularly, we predict which triplets of points should form faces. Our key innovation is a surrogate of local connectivity, calculated by comparing the intrinsic/extrinsic metrics. We learn to predict this surrogate using a deep point cloud network and then feed it to an efficient post-processing module for high-quality mesh generation. We demonstrate that our method can not only preserve details, handle ambiguous structures, but also possess strong generalizability to unseen categories by experiments on synthetic and real data. "



Paperid:327
Authors:Yuanhao Xiong, Cho-Jui Hsieh
Title: Improved Adversarial Training via Learned Optimizer
Abstract:
Adversarial attack has recently become a tremendous threat to deep learning models. To improve model robustness, adversarial training, formulated as a minimax optimization problem, has been recognized as one of the most effective defense mechanisms. However, the non-convex and non-concave property poses a great challenge to the minimax training. In this paper, we empirically demonstrate that the commonly used PGD attack may not be optimal for inner maximization, and improved inner optimizer can lead to a more robust model. Then we leverage a learning-to-learn (L2L) framework to train an optimizer with recurrent neural networks (RNN), providing update directions and steps adaptively for the inner problem. By co-training optimizer's parameters and model's weights, the proposed framework consistently improves over PGD-based adversarial training and TRADES. "



Paperid:328
Authors:Pengxu Wei, Ziwei Xie, Hannan Lu, Zongyuan Zhan, Qixiang Ye, Wangmeng Zuo, Liang Lin
Title: Component Divide-and-Conquer for Real-World Image Super-Resolution
Abstract:
In this paper, we present a large-scale Diverse Real-world image Super-Resolution dataset, i.e., DRealSR, as well as a divide-and-conquer Super-Resolution (SR) network, exploring the utility of guiding SR model with low-level image components. DRealSR establishes a new SR benchmark with diverse real-world degradation processes, mitigating the limitations of conventional simulated image degradation. In general, the targets of SR vary with image regions with different low-level image components, e.g., smoothness preserving for flat regions, sharpening for edges, and detail enhancing for textures. Learning an SR model with conventional pixel-wise loss usually is easily dominated by flat regions and edges, and fails to infer realistic details of complex textures. We propose a Component Divide-and-Conquer (CDC) model and a Gradient-Weighted (GW) loss for SR. Our CDC parses an image with three components, employs three Component-Attentive Blocks (CABs) to learn attentive masks and intermediate SR predictions with an intermediate supervision learning strategy, and trains an SR model following a divide-and-conquer learning principle. Our GW loss also provides a feasible way to balance the difficulties of image components for SR. Extensive experiments validate the superior performance of our CDC and the challenging aspects of our DRealSR dataset related to diverse real-world scenarios. Our dataset and codes are publicly available at https://github.com/xiezw5/Component-Divide-and-Conquer-for-Real-World-Image-Super-Resolution"



Paperid:329
Authors:Yunhang Shen, Rongrong Ji, Yan Wang, Zhiwei Chen, Feng Zheng, Feiyue Huang, Yunsheng Wu
Title: Enabling Deep Residual Networks for Weakly Supervised Object Detection
Abstract:
Weakly supervised object detection (WSOD) has attracted extensive research attention due to its great flexibility of exploiting large-scale image-level annotation for detector training. Whilst deep residual networks such as ResNet and DenseNet have become the standard backbones for many computer vision tasks, the cutting-edge WSOD methods still rely on plain networks, e.g., VGG, as backbones. It is indeed not trivial to employ deep residual networks for WSOD, which even shows significant deterioration of detection accuracy and non-convergence. In this paper, we discover the intrinsic root with sophisticated analysis and propose a sequence of design principles to take full advantages of deep residual learning for WSOD from the perspectives of adding redundancy, improving robustness and aligning features. First, a redundant adaptation neck is key for effective object instance localization and discriminative feature learning. Second, small-kernel convolutions and MaxPool down-samplings help improve the robustness of information flow, which gives finer object boundaries and make the detector more sensitivity to small objects. Third, dilated convolution is essential to align the proposal features and exploit diverse local information by extracting high-resolution feature maps. Extensive experiments show that the proposed principles enable deep residual networks to establishes new state-of-the-arts on PASCAL VOC and MS COCO."



Paperid:330
Authors:Hiroaki Santo, Michael Waechter, Yasuyuki Matsushita
Title: Deep near-light photometric stereo for spatially varying reflectances
Abstract:
This paper presents a near-light photometric stereo method for spatially varying reflectances. Recent studies in photometric stereo proposed learning-based approaches to handle diverse real-world reflectances and achieve high accuracy compared to conventional methods. However, they assume distant (i.e., parallel) lights, which can in practical settings only be approximately realized, and they fail in near-light conditions. Near-light photometric stereo methods address near-light conditions but previous works are limited to over-simplified reflectances, such as Lambertian reflectance. The proposed method takes a hybrid approach of distant- and near-light models, where the surface normal of a small area (corresponding to a pixel) is computed locally with a distant light assumption, and the reconstruction error is assessed based on a near-light image formation model. This paper is the first work to solve unknown, spatially varying, diverse reflectances in near-light photometric stereo. "



Paperid:331
Authors:Mert Bulent Sariyildiz, Julien Perez, Diane Larlus
Title: Learning Visual Representations with Caption Annotations
Abstract:
Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow for noisy, fewer, or even no annotations to perform such pretraining. Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations. To do so, motivated by the recent progresses in language models, we introduce {m image-conditioned masked language modeling} (ICMLM) -- a proxy task to learn visual representations over image-caption pairs. ICMLM consists in predicting masked words in captions by relying on visual cues. To tackle this task, we propose hybrid models, with dedicated visual and textual encoders, and we show that the visual representations learned as a by-product of solving this task transfer well to a variety of target tasks. Our experiments confirm that image captions can be leveraged to inject global and localized semantic information into visual representations. Project website: https://europe.naverlabs.com/ICMLM."



Paperid:332
Authors:Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, Nuno Vasconcelos
Title: Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier
Abstract:
Long-tail recognition tackles the natural non-uniformly distributed data in real-world scenarios. While modern classifiers perform well on populated classes, its performance degrades significantly on tail classes. Humans, however, are less affected by this since, when confronted with uncertain examples, they simply opt to provide coarser predictions. Motivated by this, a deep realistic taxonomic classifier (Deep-RTC) is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions. The model has the option to reject classifying samples at different levels of the taxonomy, once it cannot guarantee the desired performance. Deep-RTC is implemented with a stochastic tree sampling during training to simulate all possible classification conditions at finer or coarser levels and a rejection mechanism at inference time. Experiments on the long-tailed version of four datasets, CIFAR100, AWA2, Imagenet, and iNaturalist, demonstrate that the proposed approach preserves more information on all classes with different popularity levels. Deep-RTC also outperforms the state-of-the-art methods in longtailed recognition, hierarchical classification, and learning with rejection literature using the proposed correctly predicted bits (CPB) metric."



Paperid:333
Authors:Yanda Meng, Wei Meng, Dongxu Gao, Yitian Zhao, Xiaoyun Yang, Xiaowei Huang, Yalin Zheng
Title: Regression of Instance Boundary by Aggregated CNN and GCN
Abstract:
This paper proposes a straightforward, intuitive deep learning approach for (biomedical) image segmentation tasks. Different from the existing dense pixel classification methods, we develop a novel multilevel aggregation network to directly regress the coordinates of the boundary of instances in an end-to-end manner. The network seamlessly combines standard convolution neural network (CNN) with Attention Refinement Module (ARM) and Graph Convolution Network (GCN). By iteratively and hierarchically fusing the features across different layers of the CNN, our approach gains sufficient semantic information from the input image and pays special attention to the local boundaries with the help of ARM and GCN. In particular, thanks to the proposed aggregation GCN, our network benefits from direct feature learning of the instances’ boundary locations and the spatial information propagation across the image. Experiments on several challenging datasets demonstrate that our method achieves comparable results with state-of-the-art approaches but requires less inference time on the segmentation of fetal head in ultrasound images and of optic disc and optic cup in color fundus images."



Paperid:334
Authors:Rui Yan, Lingxi Xie, Jinhui Tang, Xiangbo Shu, Qi Tian
Title: Social Adaptive Module for Weakly-supervised Group Activity Recognition
Abstract:
This paper presents a new task named weakly-supervised group activity recognition (GAR) which differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data. This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR. To mine useful information from weak supervision, we present a key insight that key instances are likely to be related to each other, and thus design a social adaptive module (SAM) to reason about key persons and frames from noisy data. Experiments show significant improvement on the NBA dataset as well as the popular volleyball dataset. In particular, our model trained on video-level annotation achieves comparable accuracy to prior algorithms which required strong labels."



Paperid:335
Authors:Chongyi Li, Runmin Cong, Yongri Piao, Qianqian Xu, Chen Change Loy
Title: RGB-D Salient Object Detection with Cross-Modality Modulation and Selection
Abstract:
We present an effective method to progressively integrate and refine the cross-modality complementarities for RGB-D salient object detection (SOD). The proposed network mainly solves two challenging issues: 1) how to effectively integrate the complementary information from RGB image and its corresponding depth map, and 2) how to adaptively select more saliency-related features. First, we propose a cross-modality feature modulation (cmFM) module to enhance feature representations by taking the depth features as prior, which models the complementary relations of RGB-D data. Second, we propose an adaptive feature selection (AFS) module to select saliency-related features and suppress the inferior ones. The AFS module exploits multi-modality spatial feature fusion with the self-modality and cross-modality interdependencies of channel features are considered. Third, we employ a saliency-guided position-edge attention (sg-PEA) module to encourage our network to focus more on saliency-related regions. The above modules as a whole, called cmMS block, facilitates the refinement of saliency features in a coarse-to-fine fashion. Coupled with a bottom-up inference, the refined saliency features enable accurate and edge-preserving SOD. Extensive experiments demonstrate that our network outperforms state-of-the-art saliency detectors on six popular RGB-D SOD benchmarks."



Paperid:336
Authors:Hung-Yu Tseng, Hsin-Ying Lee, Lu Jiang, Ming-Hsuan Yang, Weilong Yang
Title: RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval
Abstract:
Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing. In this work, we aim to synthesize images from scene description with retrieved patches as reference. We propose a differentiable retrieval module. With the differentiable retrieval module, we can (1) make the entire pipeline end-to-end trainable, enabling the learning of better feature embedding for retrieval; (2) encourage the selection of mutually compatible patches with additional objective functions. We conduct extensive quantitative and qualitative experiments to demonstrate that the proposed method can generate realistic and diverse images, where the retrieved patches are reasonable and mutually compatible."



Paperid:337
Authors:Dongzhan Zhou, Xinchi Zhou, Hongwen Zhang, Shuai Yi, Wanli Ouyang
Title: Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection
Abstract:
In this paper, we propose a general and efficient pre-training paradigm, Montage pre-training, for object detection. Montage pre-training needs only the target detection dataset while taking only 1/4 computational resources compared to the widely adopted ImageNet pre-training. To build such an efficient paradigm, we reduce the potential redundancy by carefully extracting useful samples from the original images, assembling samples in a Montage manner as input, and using an ERF-adaptive dense classification strategy for model pre-training. These designs include not only a new input pattern to improve the spatial utilization but also a novel learning objective to expand the effective receptive field of the pre-trained model. The efficiency and effectiveness of Montage pre-training are validated by extensive experiments on the MS-COCO dataset, where the results indicate that the models using Montage pre-training are able to achieve on-par or even better detection performances compared with the ImageNet pre-training."



Paperid:338
Authors:Guan’an Wang, Shaogang Gong, Jian Cheng, Zengguang Hou
Title: Faster Person Re-Identification
Abstract:
Fast person re-identification (ReID) aims to search person images quickly and accurately. The main idea of recent fast ReID methods is the hashing algorithm, which learns compact binary codes and performs fast Hamming distance and counting sort. However, a very long code is needed for high accuracy (e.g. 2048), which compromises search speed. In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Specifically, we design an All-in-One (AiO) framework together with a Distance Threshold Optimization (DTO) algorithm. In AiO, we simultaneously learn and enhance multiple codes of different lengths in a single model. It learns multiple codes in a pyramid structure, and encourage shorter codes to mimic longer codes by self-distillation. DTO solves a complex threshold search problem by a simple optimization process, and the balance between accuracy and speed is easily controlled by a single parameter. It formulates the optimization target as a $F_{eta}$ score that can be optimised by Gaussian cumulative distribution functions. Experimental results on 2 datasets show that our proposed method (CtF) is not only 8% more accurate but also 5x faster than contemporary hashing ReID methods. Compared with non-hashing ReID methods, CtF is 50x faster with comparable accuracy. Code is available at https://github.com/wangguanan/light-reid."



Paperid:339
Authors:Max Ehrlich, Ser-Nam Lim, Larry Davis, Abhinav Shrivastava
Title: Quantization Guided JPEG Artifact Correction
Abstract:
The JPEG image compression algorithm is the most popular method of image compression because of it’s ability for large compression ratios. However, to achieve such high compression, information is lost. For aggressive quantization settings, this leads to a noticeable reduction in image quality. Artifact correction has been studied in the context of deep neural networks for some time, but the current state-of-the-art methods require a different model to be trained for each quality setting, greatly limiting their practical application. We solve this problem by creating a novel architecture which is parameterized by the JPEG file’s quantization matrix. This allows our single model to achieve state-of-the-art performance over models trained for specific quality settings."



Paperid:340
Authors:Yujun Chen, Manoj Kumar Sharma, Ashutosh Sabharwal, Ashok Veeraraghavan, Aswin C. Sankaranarayanan
Title: 3PointTM: Faster Measurement of High-Dimensional Transmission Matrices
Abstract:
A transmission matrix (TM) describes the linear relationship between input and output phasor fields when a coherent wave passes through a scattering medium. Measurement of the TM enables numerous applications, but is challenging and time-intensive for an arbitrary medium. State-of-the-art methods, including phase-shifting holography and double phase retrieval, require significant amounts of measurements, and post-capture reconstruction that is often computationally intensive. In this paper, we propose 3PointTM, an approach for sensing TMs that uses a minimal number of measurements per pixel - reducing the measurement budget by a factor of two as compared to state of the art in phase-shifting holography for measuring TMs - and has a low computational complexity as compared to phase retrieval. We validate our approach on real and simulated data, and show successful focusing of light and image reconstruction on dense scattering media."



Paperid:341
Authors:Xide Xia, Meng Zhang, Tianfan Xue, Zheng Sun, Hui Fang, Brian Kulis , Jiawen Chen
Title: Joint Bilateral Learning for Real-time Universal Photorealistic Style Transfer
Abstract:
Photorealistic style transfer is the task of transferring the artistic style of an image onto a content target, producing a result that is plausibly taken with a camera. Recent approaches, based on deep neural networks, produce impressive results but are either too slow to run at practical resolutions, or still contain objectionable artifacts. We propose a new end-to-end model for photorealistic style transfer that is both fast and inherently generates photorealistic results. The core of our approach is a feed-forward neural network that learns local edge-aware affine transforms that automatically obey the photorealism constraint. When trained on a diverse set of images and a variety of styles, our model can robustly apply style transfer to an arbitrary pair of input images. Compared to the state of the art, our method produces visually superior results and is three orders of magnitude faster, enabling real-time performance at 4K on a mobile phone. We validate our method with ablation and user studies."



Paperid:342
Authors:Xiangyu Zhu, Fan Yang, Di Huang, Chang Yu, Hao Wang, Jianzhu Guo, Zhen Lei, Stan Z. Li
Title: Beyond 3DMM Space: Towards Fine-grained 3D Face Reconstruction
Abstract:
Recently, deep learning based 3D face reconstruction methods have shown promising results in both quality and efficiency. However, most of their training data is constructed by 3D Morphable Model, whose space spanned is only a small part of the shape space. As a result, the reconstruction results lose the fine-grained geometry and look different from real faces. To alleviate this issue, we first propose a solution to construct large-scale fine-grained 3D data from RGB-D images, which are expected to be massively collected as the proceeding of hand-held depth camera. A new dataset Fine-Grained 3D face (FG3D) with 200k samples is constructed to provide sufficient data for neural network training. Secondly, we propose a Fine-Grained reconstruction Network (FGNet) that can concentrate on shape modification by warping the network input and output to the UV space. Through FG3D and FGNet, we successfully generate reconstruction results with fine-grained geometry. The experiments on several benchmarks validate the effectiveness of our method compared to several baselines and other state-of-the-art methods. "



Paperid:343
Authors:Arun Mallya, Ting-Chun Wang, Karan Sapra, Ming-Yu Liu
Title: World-Consistent Video-to-Video Synthesis
Abstract:
Video-to-video synthesis is a powerful tool for converting high-level semantic inputs to photorealistic videos. However, while existing vid2vid methods can maintain short-term temporal consistency, they fail to ensure long-term consistency in the outputs. This is because they generate each frame only based on the past few frames. They lack knowledge of the 3D world being generated. In this work, we propose a framework for utilizing all past generated frames when synthesizing each frame. This is achieved by condensing the 3D world generated so far into a physically-grounded estimate of the current frame, which we call the guidance image. A novel module is also proposed to take advantage of the information stored in the guidance images. Extensive experimental results on several challenging datasets verify the effectiveness of our method in achieving world consistency - the output video is consistent within the entire generated 3D world. Using Adobe Acrobat Reader is highly recommended for playing the videos embedded in this submission."



Paperid:344
Authors:Qi Fan, Lei Ke, Wenjie Pei, Chi-Keung Tang, Yu-Wing Tai
Title: Commonality-Parsing Network across Shape and Appearance for Partially Supervised Instance Segmentation
Abstract:
Partially supervised instance segmentation aims to perform learning on limited mask-annotated categories of data thus eliminating expensive and exhaustive mask annotation. The learned models are expected to be generalizable to novel categories. Existing methods either learn a transfer function from detection to segmentation, or cluster shape priors for segmenting novel categories. We propose to learn the underlying class-agnostic commonalities that can be generalized from mask-annotated categories to novel categories. Specifically, we parse two types of commonalities: 1) shape commonalities which are learned by performing supervised learning on instance boundary prediction; and 2) appearance commonalities which are captured by modeling pairwise affinities among pixels of feature maps to optimize the separability between instance and the background. Incorporating both the shape and appearance commonalities, our model significantly outperforms the state-of-the-art methods on both partially supervised setting and few-shot setting for instance segmentation on COCO dataset. The code is available at https://github.com/fanq15/CPMask."



Paperid:345
Authors:Umberto Michieli, Edoardo Borsato, Luca Rossi, Pietro Zanuttigh
Title: GMNet: Graph Matching Network for Large Scale Part Semantic Segmentation in the Wild
Abstract:
The semantic segmentation of parts of objects in the wild is a challenging task in which multiple instances of objects and multiple parts within those objects must be detected in the scene. This problem remains nowadays very marginally explored, despite its fundamental importance towards detailed object understanding. In this work, we propose a novel framework combining higher object-level context conditioning and part-level spatial relationships to address the task. To tackle object-level ambiguity, a class-conditioning module is introduced to retain class-level semantics when learning parts-level semantics. In this way, mid-level features carry also this information prior to the decoding stage. To tackle part-level ambiguity and localization we propose a novel adjacency graph-based module that aims at matching the relative spatial relationships between ground truth and predicted parts. The experimental evaluation on the Pascal-Part dataset shows that we achieve state-of-the-art results on this task."



Paperid:346
Authors:Nico Messikommer, Daniel Gehrig, Antonio Loquercio, Davide Scaramuzza
Title: Event-based Asynchronous Sparse Convolutional Networks
Abstract:
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse “events”. Recently, pattern recognition algorithms, such as learning-based methods, have made significant progress with event cameras by converting events into synchronous dense, image-like representations and applying traditional machine learning methods developed for standard cameras. However, these approaches discard the spatial and temporal sparsity inherent in event data at the cost of higher computational complexity and latency. In this work, we present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output, thus directly leveraging the intrinsic asynchronous and sparse nature of the event data. We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks without sacrificing accuracy. In addition, our framework has several desirable characteristics: (i) it exploits spatial and temporal sparsity in the event data explicitly, (ii) it is agnostic to the event representation, network architecture, and task, and (iii) it does not require any train-time change, since it is compatible with the standard neural net-works’ training process. We thoroughly validate the proposed framework on two computer vision tasks: object detection and object recognition. In these tasks, we reduce the computational complexity up to 20 times with respect to high-latency neural networks. At the same time, we outperform state-of-the-art asynchronous approaches up to 24.5% in prediction accuracy."



Paperid:347
Authors:Giovanni Pintore, Marco Agus, Enrico Gobbetti
Title: AtlantaNet: Inferring the 3D Indoor Layout from a Single 360(∘) Image beyond the Manhattan World Assumption
Abstract:
We introduce a novel end-to-end approach to predict a 3D room layout from a single panoramic image. Compared to recent state-of-the-art works, our method is not limited to Manhattan World environments, and can reconstruct rooms bounded by vertical walls that do not form right angles or are curved -- i.e., Atlanta World models. In our approach, we project the original gravity-aligned panoramic image on two horizontal planes, one above and one below the camera. This representation encodes all the information needed to recover the Atlanta World 3D bounding surfaces of the room in the form of a 2D room footprint on the floor plan and a room height. To predict the 3D layout, we propose an encoder-decoder neural network architecture, leveraging Recurrent Neural Networks (RNNs) to capture long-range geometric patterns, and exploiting a customized training strategy based on domain-specific knowledge. The experimental results demonstrate that our method outperforms state-of-the-art solutions in prediction accuracy, in particular in cases of complex wall layouts or curved wall footprints."



Paperid:348
Authors:Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua
Title: AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification
Abstract:
Convolutional operations have two limitations: (1) do not explicitly model where to focus as the same filter is applied to all the positions, and (2) are unsuitable for modeling long-range dependencies as they only operate on a small neighborhood. While both limitations can be alleviated by attention operations, many design choices remain to be determined to use attention, especially when applying attention to videos. Towards a principled way of applying attention to videos, we address the task of spatiotemporal attention cell search. We propose a novel search space for spatiotemporal attention cells, which allows the search algorithm to flexibly explore various design choices in the cell. The discovered attention cells can be seamlessly inserted into existing backbone networks, e.g., I3D or S3D, and improve video classification accuracy by more than 2\% on both Kinetics-600 and MiT datasets. The discovered attention cells outperform non-local blocks on both datasets, and demonstrate strong generalization across different modalities, backbones, and datasets. Inserting our attention cells into I3D-R50 yields state-of-the-art performance on both datasets."



Paperid:349
Authors:Tyler L. Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya, Christopher Kanan
Title: REMIND Your Neural Network to Prevent Catastrophic Forgetting
Abstract:
People learn throughout life. However, incrementally updating conventional neural networks leads to catastrophic forgetting. A common remedy is replay, which is inspired by how the brain consolidates memory. Replay involves fine-tuning a network on a mixture of new and old instances. While there is neuroscientific evidence that the brain replays compressed memories, existing methods for convolutional networks replay raw images. Here, we propose REMIND, a brain-inspired approach that enables efficient replay with compressed representations. REMIND is trained in an online manner, meaning it learns one example at a time, which is closer to how humans learn. Under the same constraints, REMIND outperforms other methods for incremental class learning on the ImageNet ILSVRC-2012 dataset. We probe REMIND's robustness to data ordering schemes known to induce catastrophic forgetting. We demonstrate REMIND's generality by pioneering online learning for Visual Question Answering (VQA)."



Paperid:350
Authors:Abhiram Gnanasambandam, Stanley H. Chan
Title: Image Classification in the Dark using Quanta Image Sensors
Abstract:
State-of-the-art image classifiers are trained and tested using well-illuminated images. These images are typically captured by CMOS image sensors with at least tens of photons per pixel. However, in dark environments when the photon flux is low, image classification becomes difficult because the measured signal is suppressed by noise. In this paper, we present a new low-light image classification solution using Quanta Image Sensors (QIS). QIS are a new type of image sensors that possess photon counting ability without compromising on pixel size and spatial resolution. Numerous studies over the past decade have demonstrated the feasibility of QIS for low-light imaging, but their usage for image classification has not been studied. This paper fills the gap by presenting a novel student-teacher learning scheme which allows us to classify the noisy QIS raw data. We show that with student-teacher learning, we are able to achieve image classification at a photon level of one photon per pixel or lower. Experimental results verify the effectiveness of the proposed method compared to existing solutions."



Paperid:351
Authors:Yan Luo, Yongkang Wong, Mohan S. Kankanhalli, Qi Zhao
Title: n-Reference Transfer Learning for Saliency Prediction
Abstract:
Benefiting from deep learning research and large-scale datasets, saliency prediction has achieved significant success in the past decade. However, it still remains challenging to predict saliency maps on images in new domains that lack sufficient data for data-hungry models. To solve this problem, we propose a few-shot transfer learning paradigm for saliency prediction, which enables efficient transfer of knowledge learned from the existing large-scale saliency datasets to a target domain with limited labeled examples. Specifically, very few target domain examples are used as the reference to train a model with a source domain dataset such that the training process can converge to a local minimum in favor of the target domain. Then, the learned model is further fine-tuned with the reference. The proposed framework is gradient-based and model-agnostic. We conduct comprehensive experiments and ablation study on various source domain and target domain pairs. The results show that the proposed framework achieves a significant performance improvement. The code is publicly available at https://github.com/luoyan407/n-reference."



Paperid:352
Authors:Shuhan Chen, Yun Fu
Title: Progressively Guided Alternate Refinement Network for RGB-D Salient Object Detection
Abstract:
In this paper, we aim to develop an efficient and compact deep network for RGB-D salient object detection, where the depth image provides complementary information to boost performance in complex scenarios. Starting from a coarse initial prediction by a multi-scale residual block, we propose a progressively guided alternate refinement network to refine it. Instead of using ImageNet pre-trained backbone network, we first construct a lightweight depth stream by learning from scratch, which can extract complementary features more efficiently with less redundancy. Then, different from the existing fusion based methods, RGB and depth features are fed into proposed guided residual (GR) blocks alternately to reduce their mutual degradation. By assigning progressive guidance in the stacked GR blocks within each side-output, the false detection and missing parts can be well remedied. Extensive experiments on seven benchmark datasets demonstrate that our model outperforms existing state-of-the-art approaches by a large margin, and also shows superiority in efficiency (71 FPS) and model size (64.9 MB)."



Paperid:353
Authors:Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, Qi Tian
Title: Bottom-Up Temporal Action Localization with Mutual Regularization
Abstract:
Recently, temporal action localization (TAL), extit{i.e.}, finding specific action segments in untrimmed videos, has attracted increasing attentions of the computer vision community. State-of-the-art solutions for TAL involves evaluating the frame-level probabilities of three action-indicating phases, extit{i.e.} starting, continuing, and ending; and then post-processing these predictions for the final localization. This paper delves deep into this mechanism, and argues that existing methods, by modeling these phases as individual classification tasks, ignored the potential temporal constraints between them. This can lead to incorrect and/or inconsistent predictions when some frames of the video input lack sufficient discriminative information. To alleviate this problem, we introduce two regularization terms to mutually regularize the learning procedure: the Intra-phase Consistency (IntraC) regularization is proposed to make the predictions verified inside each phase; and the Inter-phase Consistency (InterC) regularization is proposed to keep consistency between these phases. Jointly optimizing these two terms, the entire framework is aware of these potential constraints during an end-to-end optimization process. Experiments are performed on two popular TAL datasets, THUMOS14 and ActivityNet1.3. Our approach clearly outperforms the baseline both quantitatively and qualitatively. The proposed regularization also generalizes to other TAL methods ( extit{e.g.}, TSA-Net and PGCN). code: https://github.com/PeisenZhao/Bottom-Up-TAL-with-MR"



Paperid:354
Authors:Christian Simon, Piotr Koniusz, Richard Nock, Mehrtash Harandi
Title: On Modulating the Gradient for Meta-Learning
Abstract:
Inspired by optimization techniques, we propose a novel meta-learning algorithm with gradient modulation to encourage fast-adaptation of neural networks in the absence of abundant data. Our method, termed ModGrad, is designed to circumvent the noisy nature of the gradients which is prevalent in low-data regimes. Furthermore and having the scalability concern in mind, we formulate ModGrad via low-rank approximations, which in turn enables us to employ ModGrad to adapt hefty neural networks. We thoroughly assess and contrast ModGrad against a large family of meta-learning techniques and observe that the proposed algorithm outperforms baselines comfortably while enjoying faster convergence."



Paperid:355
Authors:Hsin-Yu Chang, Zhixiang Wang, Yung-Yu Chuang
Title: Domain-Specific Mappings for Generative Adversarial Style Transfer
Abstract:
Style transfer generates an image whose content comes from one image and style from the other. Image-to-image translation approaches with disentangled representations have been shown effective for style transfer between two image categories. However, previous methods often assume a shared domain-invariant content space, which could compromise the content representation power. For addressing this issue, this paper leverages domain-specific mappings for remapping latent features in the shared content space to domain-specific content spaces. This way, images can be encoded more properly for style transfer. Experiments show that the proposed method outperforms previous style transfer methods, particularly on challenging scenarios that would require semantic correspondences between images. Code and results are available at https://github.com/acht7111020/DSMAP."



Paperid:356
Authors:Timo Milbich, Karsten Roth, Homanga Bharadhwaj, Samarth Sinha, Yoshua Bengio, Björn Ommer, Joseph Paul Cohen
Title: DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning
Abstract:
Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities which not only generalize from training data to identically distributed test distributions, but in particular also translate to unknown test classes. However, its prevailing learning paradigm is class-discriminative supervised training, which typically results in representations specialized in separating training classes. For effective generalization, however, such an image representation needs to capture a diverse range of data characteristics. To this end, we propose and study multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting. Through simultaneous optimization of our tasks we learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance on multiple established DML benchmark datasets."



Paperid:357
Authors:Yawei Li, Shuhang Gu, Kai Zhang, Luc Van Gool, Radu Timofte
Title: DHP: Differentiable Meta Pruning via HyperNetworks
Abstract:
Network pruning has been the driving force for the acceleration of neural networks and the alleviation of model storage/transmission burden. With the advent of AutoML and neural architecture search (NAS), pruning has become topical with automatic mechanism and searching based architecture optimization. Yet, current automatic designs rely on either reinforcement learning or evolutionary algorithm. Due to the non-differentiability of those algorithms, the pruning algorithm needs a long searching stage before reaching the convergence. To circumvent this problem, this paper introduces a differentiable pruning method via hypernetworks for automatic network pruning. The specifically designed hypernetworks take latent vectors as input and generate the weight parameters of the backbone network. The latent vectors control the output channels of the convolutional layers in the backbone network and act as a handle for the pruning of the layers. By enforcing $ll_1$ sparsity regularization to the latent vectors and utilizing proximal gradient solver, sparse latent vectors can be obtained. Passing the sparsified latent vectors through the hypernetworks, the corresponding slices of the generated weight parameters can be removed, achieving the effect of network pruning. The latent vectors of all the layers are pruned together, resulting in an automatic layer configuration. Extensive experiments are conducted on various networks for image classification, single image super-resolution, and denoising. And the experimental results validate the proposed method. Code is available at https://github.com/ofsoundof/dhp"



Paperid:358
Authors:Zheng Xie, Zhiquan Wen, Jing Liu, Zhiqiang Liu, Xixian Wu, Mingkui Tan
Title: Deep Transferring Quantization
Abstract:
Network quantization is an effective method for network compression. Existing methods train a low-precision network by fine-tuning from a pre-trained model. However, training a low-precision network often requires large-scale labeled data to achieve superior performance. In many real-world scenarios, only limited labeled data are available due to expensive labeling costs or privacy protection. With limited training data, fine-tuning methods may suffer from the overfitting issue and substantial accuracy loss. To alleviate these issues, we introduce transfer learning into network quantization to obtain an accurate low-precision model. Specifically, we propose a method named deep transferring quantization (DTQ) to effectively exploit the knowledge in a pre-trained full-precision model. To this end, we propose a learnable attentive transfer module to identify the informative channels for alignment. In addition, we introduce the Kullback–Leibler (KL) divergence to further help train a low-precision model. Extensive experiments on both image classification and face recognition demonstrate the effectiveness of DTQ."



Paperid:359
Authors:Guangyi Chen, Yuhao Lu, Jiwen Lu, Jie Zhou
Title: Deep Credible Metric Learning for Unsupervised Domain Adaptation Person Re-identification
Abstract:
The trained person re-identification systems fundamentally need to be deployed on different target environments. Learning the cross-domain model has great potential for the scalability of real-world applications. In this paper, we propose a deep credible metric learning (DCML) method for unsupervised domain adaptation person re-identification. Unlike existing methods that directly finetune the model in the target domain with pseudo labels generated by the source pre-trained model, our DCML method adaptively mines credible samples for training to avoid the misleading from noise labels. Specifically, we design two credibility metrics for sample mining including the k-Nearest Neighbor similarity for density evaluation and the prototype similarity for centrality evaluation. As the increasing of the pseudo label credibility, we progressively adjust the sampling strategy in the training process. In addition, we propose an instance margin spreading loss to further increase instance-wise discrimination. Experimental results demonstrate that our DCML method explores credible and valuable training data and improves the performance of unsupervised domain adaptation. "



Paperid:360
Authors:Guangyi Chen, Yongming Rao, Jiwen Lu, Jie Zhou
Title: Temporal Coherence or Temporal Motion: Which is More Critical for Video-based Person Re-identification?
Abstract:
Video-based person re-identification aims to match pedestrians with the consecutive video sequences. While a rich line of work focuses solely on extracting the motion features from pedestrian videos, we show in this paper that the temporal coherence plays a more critical role. To distill the temporal coherence part of video representation from frame representations, we propose a simple yet effective Adversarial Feature Augmentation (AFA) method, which highlights the temporal coherence features by introducing adversarial augmented temporal motion noise. Specifically, we disentangle the video representation into the temporal coherence and motion parts and randomly change the scale of the temporal motion features as the adversarial noise. The proposed AFA method is a general lightweight component that can be readily incorporated into various methods with negligible cost. We conduct extensive experiments on three challenging datasets including MARS, iLIDS-VID, and DukeMTMC-VideoReID, and the experimental results verify our argument and demonstrate the effectiveness of the proposed method."



Paperid:361
Authors:Xue Yang, Junchi Yan
Title: Arbitrary-Oriented Object Detection with Circular Smooth Label
Abstract:
Arbitrary-oriented object detection has recently attracted increasing attention in vision for their importance in aerial imagery, scene text, and face etc. In this paper, we show that existing regression-based rotation detectors suffer the problem of discontinuous boundaries, which is directly caused by angular periodicity or corner ordering. By a careful study, we find the root cause is that the ideal predictions are beyond the defined range. We design a new rotation detection baseline, to address the boundary problem by transforming angular prediction from a regression problem to a classification task with little accuracy loss, whereby high-precision angle classification is devised in contrast to previous works using coarse-granularity in rotation detection. We also propose a circular smooth label (CSL) technique to handle the periodicity of the angle and increase the error tolerance to adjacent angles. We further introduce four window functions in CSL and explore the effect of different window radius sizes on detection performance. Extensive experiments and visual analysis on two large-scale public datasets for aerial images i.e. DOTA, HRSC2016, as well as scene text dataset ICDAR2015 and MLT, show the effectiveness of our approach. The code is public available at https://github.com/Thinklab-SJTU/CSL_RetinaNet_Tensorflow."



Paperid:362
Authors:Songnan Lin, Jiawei Zhang, Jinshan Pan, Zhe Jiang, Dongqing Zou, Yongtian Wang, Jing Chen, Jimmy Ren
Title: Learning Event-Driven Video Deblurring and Interpolation
Abstract:
Event-based sensors, which have a response if the change of pixel intensity exceeds a triggering threshold, can capture high-speed motion with microsecond accuracy. Assisted by an event camera, we can generate high frame-rate sharp videos from low frame-rate blurry ones captured by an intensity camera. In this paper, we propose an effective event-driven video deblurring and interpolation algorithm based on deep convolutional neural networks (CNNs). Motivated by the physical model that the residuals between a blurry image and sharp frames are the integrals of events, the proposed network uses events to estimate the residuals for the sharp frame restoration. As the triggering threshold varies spatially, we develop an effective method to estimate dynamic filters to solve this problem. To utilize the temporal information, the sharp frames restored from the previous blurry frame are also considered. The proposed algorithm achieves superior performance against state-of-the-art methods on both synthetic and real datasets."



Paperid:363
Authors:Nelson Nauata, Yasutaka Furukawa
Title: Vectorizing World Buildings: Planar Graph Reconstruction by Primitive Detection and Relationship Inference
Abstract:
This paper tackles a 2D architecture vectorization problem, whose task is to infer an outdoor building architecture as a 2D planar graph from a single RGB image. We provide a new benchmark with ground-truth annotations for 2,001 complex buildings across the cities of Atlanta, Paris, and Las Vegas. We also propose a novel algorithm utilizing 1) convolutional neural networks (CNNs) that detects geometric primitives and infers their relationships and 2) an integer programming (IP) that assembles the information into a 2D planar graph. While being a trivial task for human vision, the inference of a graph structure with an arbitrary topology is still an open problem for computer vision. Qualitative and quantitative evaluations demonstrate that our algorithm makes significant improvements over the current state-of-the-art, towards an intelligent system at the level of human perception. We will share code and data."



Paperid:364
Authors:Hang Wang, Minghao Xu, Bingbing Ni, Wenjun Zhang
Title: Learning to Combine: Knowledge Aggregation for Multi-Source Domain Adaptation
Abstract:
Transferring knowledges learned from multiple source domains to target domain is a more practical and challenging task than conventional single-source domain adaptation. Furthermore, the increase of modalities brings more difficulty in aligning feature distributions among multiple domains. To mitigate these problems, we propose a Learning to Combine for Multi-Source Domain Adaptation (LtC-MSDA) framework via exploring interactions among domains. In the nutshell, a knowledge graph is constructed on the prototypes of various domains to realize the information propagation among semantically adjacent representations. On such basis, a graph model is learned to predict query samples under the guidance of correlated prototypes. In addition, we design a Relation Alignment Loss (RAL) to facilitate the consistency of categories' relational interdependency and the compactness of features, which boosts features' intra-class invariance and inter-class separability. Comprehensive results on public benchmark datasets demonstrate that our approach outperforms existing methods with a remarkable margin."



Paperid:365
Authors:Jiahua Dong, Yang Cong, Gan Sun, Yuyang Liu, Xiaowei Xu
Title: CSCL: Critical Semantic-Consistent Learning for Unsupervised Domain Adaptation
Abstract:
Unsupervised domain adaptation without consuming annotation process for unlabeled target data attracts appealing interests in semantic segmentation. However, 1) existing methods neglect that not all semantic representations across domains are transferable, which cripples domain-wise transfer with untransferable knowledge; 2) they fail to narrow category-wise distribution shift due to category-agnostic feature alignment. To address above challenges, we develop a new Critical Semantic-Consistent Learning (CSCL) model, which mitigates the discrepancy of both domain-wise and category-wise distributions. Specifically, a critical transfer based adversarial framework is designed to highlight transferable domain-wise knowledge while neglecting untransferable knowledge. Transferability-critic guides transferability-quantizer to maximize positive transfer gain under reinforcement learning manner, although negative transfer of untransferable knowledge occurs. Meanwhile, with the help of confidence-guided pseudo labels generator of target samples, a symmetric soft divergence loss is presented to explore inter-class relationships and facilitate category-wise distribution alignment. Experiments on several datasets demonstrate the superiority of our model."



Paperid:366
Authors:Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, Qixiang Ye
Title: Prototype Mixture Models for Few-shot Semantic Segmentation
Abstract:
Few-shot segmentation is challenging because objects within the support and query images could significantly differ in appearance and pose. Using a single prototype acquired directly from the support image to segment the query image causes semantic ambiguity. In this paper, we propose prototype mixture models (PMMs), which correlate diverse image regions with multiple prototypes to enforce the prototype-based semantic representation. Estimated by an Expectation-Maximization algorithm, PMMs incorporate rich channel-wised and spatial semantics from limited support images. Utilized as representations as well as classifiers, PMMs fully leverage the semantics to activate objects in the query image while depressing background regions in a duplex manner. Extensive experiments on Pascal VOC and MS-COCO datasets show that PMMs significantly improve upon state-of-the-arts. Particularly, PMMs improve 5-shot segmentation performance on MS-COCO by up to 5.82% with only a moderate cost for model size and inference speed."



Paperid:367
Authors:Jingkang Yang, Litong Feng, Weirong Chen, Xiaopeng Yan, Huabin Zheng , Ping Luo, Wayne Zhang
Title: Webly Supervised Image Classification with Self-Contained Confidence
Abstract:
This paper focuses on webly supervised learning (WSL), where datasets are built by crawling samples from the Internet and adopting search queries directly as their web labels. Although WSL benefits from fast and low-cost data expansion, noisy web labels prevent models from reliable predictions. To mitigate this problem, self-labeled supervised loss $\mathcal{L}_s$ is utilized together with webly supervised loss $\mathcal{L}_w$ in recent works. $\mathcal{L}_s$ relies on machine labels predicted by the model itself. Since whether web labels or machine labels being correct is usually on a case-by-case basis for web samples, it is desirable to adjust balance between $\mathcal{L}_s$ and $\mathcal{L}_w$ at a sample-level. Inspired by DNNs' ability on confidence prediction, we introduce self-contained confidence (SCC) by adapting model uncertainty for WSL setting and use it to sample-wisely balance $\mathcal{L}_s$ and $\mathcal{L}_w$. Thus, a simple yet effective WSL framework is proposed. A series of SCC-friendly approaches are investigated, among which our proposed graph-enhanced mixup stands out as the most effective approach to provide high-quality confidence to boost our framework. The proposed WSL framework achieved state-of-the-art results on two large-scale WSL datasets of WebVision-1000 and Food101-N."



Paperid:368
Authors:Haibao Yu, Qi Han, Jianbo Li, Jianping Shi, Guangliang Cheng, Bin Fan
Title: Search What You Want: Barrier Panelty NAS for Mixed Precision Quantization
Abstract:
Emergent hardwares can support mixed precision CNN models inference that assign different bitwidths for different layers. Learning to find an optimal mixed precision model that can preserve accuracy and satisfy the specific constraints on model size and computation is extremely challenge due to the difficult in training a mixed precision model and the huge space of all possible bit quantizations. In this paper, we propose a novel soft Barrier Penalty based NAS (BP-NAS) for mixed precision quantization, which ensures all the searched models are inside the valid domain defined by the complexity constraint, thus could return an optimal model under the given constraint by conducting search only one time. The proposed soft Barrier Penalty is differentiable and can impose very large losses to those models outside the valid domain while almost no punishment for models inside the valid domain, thus constraining the search only in the feasible domain. In addition, a differentiable Prob-1 regularizer is proposed to ensure learning with NAS is reasonable. A distribution reshaping training strategy is also used to make training more stable. BP-NAS sets new state of the arts on both classification (Cifar-10, ImageNet) and detection (COCO), surpassing all the efficient mixed precision methods designed manually and automatically. Particularly, BP-NAS achieves higher mAP (up to 2.7% mAP improvement) together with lower bit computation cost compared with the existing best mixed precision model on COCO detection."



Paperid:369
Authors:Xiaoqing Ye, Liang Du, Yifeng Shi, Yingying Li, Xiao Tan, Jianfeng Feng, Errui Ding, Shilei Wen
Title: Monocular 3D Object Detection via Feature Domain Adaptation
Abstract:
Monocular 3D object detection is a challenging task due to unreliable depth, resulting in a distinct performance gap between monocular and LiDAR-based approaches. In this paper, we propose a novel domain adaptation based monocular 3D object detection framework named DA-3Ddet, which adapts the feature from unsound image-based pseudo-LiDAR domain to the accurate real LiDAR domain for performance boosting. In order to solve the overlooked problem of inconsistency between the foreground mask of pseudo and real LiDAR caused by inaccurately estimated depth, we also introduce a context-aware foreground segmentation module which helps to involve relevant points for foreground masking. Extensive experiments on KITTI dataset demonstrate that our simple yet effective framework outperforms other state-of-the-arts by a large margin."



Paperid:370
Authors:Lele Chen, Guofeng Cui, Celong Liu, Zhong Li, Ziyi Kou, Yi Xu, Chenliang Xu
Title: Talking-head Generation with Rhythmic Head Motion
Abstract:
When people deliver a speech, they naturally move heads, and this rhythmic head motion conveys linguistic information. However, generating a lip-synced video while moving head naturally is challenging. While remarkably successful, existing works either generate still talking-face videos or rely on landmark/video frames as sparse/dense mapping guidance to generate head movements, which leads to unrealistic or uncontrollable video synthesis. To overcome the limitations, we propose a 3D-aware generative network along with a hybrid embedding module and a non-linear composition module. Through modeling the head motion and facial expressions explicitly, manipulating 3D animation carefully, and embedding reference images dynamically, our approach achieves controllable, photorealistic, and temporally coherent talking-head videos with natural head movements. Thoughtful experiments on several standard benchmarks demonstrate that our method achieves significantly better results than the state-of-the-art methods in both quantitative and qualitative comparisons. The code is available on https://github.com/lelechen63/Talking-head-Generation-with-Rhythmic-Head-Motion"



Paperid:371
Authors:Xiaofeng Liu, Tong Che, Yiqun Lu, Chao Yang, Site Li, Jane You
Title: AUTO3D: Novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation
Abstract:
This paper targets on learning-based novel view synthesis from a single or limited 2D images without the pose supervision. In the viewer-centered coordinates, we construct an end-to-end trainable conditional variational framework to disentangle the unsupervisely learned relative-pose/rotation and implicit global 3D representation (shape, texture and the origin of viewer-centered coordinates, etc.). The global appearance of the 3D object is given by several ``appearance-describing"" images taken from any number of viewpoints. Our spatial correlation module extracts a global 3D representation from the appearance-describing images in a permutation invariant manner. Our system can achieve implicitly 3D understanding without explicitly 3D reconstruction. With an unsupervisely learned viewer-centered relative-pose/rotation code, the decoder can hallucinate the novel view continuously by sampling the relative-pose in a prior distribution. In various applications, we demonstrate that our model can achieve comparable or even better results than pose/3D model-supervised learning-based novel view synthesis (NVS) methods with any number of input views."



Paperid:372
Authors:Srijan Das, Saurav Sharma, Rui Dai, François Brémond, Monique Thonnat
Title: VPN: Learning Video-Pose Embedding for Activities of Daily Living
Abstract:
In this paper, we focus on the spatio-temporal aspect of recognizing Activities of Daily Living (ADL). ADL have two specific properties (i) subtle spatio-temporal patterns and (ii) similar visual patterns varying with time. Therefore, ADL may look very similar and often necessitate to look at their fine-grained details to distinguish them. Because the recent spatio-temporal 3D ConvNets are too rigid to capture the subtle visual patterns across an action, we propose a novel Video-Pose Network: VPN. The 2 key components of this VPN are a spatial embedding and an attention network. The spatial embedding projects the 3D poses and RGB cues in a common semantic space. This enables the action recognition framework to learn better spatio-temporal features exploiting both modalities. In order to discriminate similar actions, the attention network provides two functionalities - (i) an end-to-end learnable pose backbone exploiting the topology of human body, and (ii) a coupler to provide joint spatio-temporal attention weights across a video. Experiments show that VPN outperforms the state-of-the-art results for action classification on a large scale human activity dataset: NTU-RGB+D 120, its subset NTU-RGB+D 60, a real-world challenging human activity dataset: Toyota Smarthome and a small scale human-object interaction dataset Northwestern UCLA."



Paperid:373
Authors:Chenchen Zhu, Fangyi Chen, Zhiqiang Shen, Marios Savvides
Title: Soft Anchor-Point Object Detection
Abstract:
Recently, anchor-free detection methods have been through great progress. The major two families, anchor-point detection and key-point detection, are at opposite edges of the speed-accuracy trade-off, with anchor-point detectors having the speed advantage. In this work, we boost the performance of the anchor-point detector over the key-point counterparts while maintaining the speed advantage. To achieve this, we formulate the detection problem from the anchor point's perspective and identify ineffective training as the main problem. Our key insight is that anchor points should be optimized jointly as a group both within and across feature pyramid levels. We propose a simple yet effective training strategy with soft-weighted anchor points and soft-selected pyramid levels to address the false attention issue within each pyramid level and the feature selection issue across all the pyramid levels, respectively. To evaluate the effectiveness, we train a single-stage anchor-free detector called Soft Anchor-Point Detector (SAPD). Experiments show that our concise SAPD pushes the envelope of speed/accuracy trade-off to a new level, outperforming recent state-of-the-art anchor-free and anchor-based detectors. Without bells and whistles, our best model can achieve a single-model single-scale AP of 47.4% on COCO. "



Paperid:374
Authors:Jun Gao, Zian Wang, Jinchen Xuan, Sanja Fidler
Title: Beyond Fixed Grid: Learning Geometric Image Representation with a Deformable Grid
Abstract:
In modern computer vision, images are typically represented as a fixed uniform grid with some stride and processed via a deep convolutional neural network. We argue that deforming the grid to better align with the high-frequency image content is a more effective strategy. We introduce mph{Deformable Grid} (Defgrid), a learnable neural network module that predicts location offsets of vertices of a 2-dimensional triangular grid such that the edges of the deformed grid align with image boundaries. We showcase our Defgrid in a variety of use cases, i.e., by inserting it as a module at various levels of processing. We utilize Defgrid as an end-to-end mph{learnable geometric downsampling} layer that replaces standard pooling methods for reducing feature resolution when feeding images into a deep CNN. We show significantly improved results at the same grid resolution compared to using CNNs on uniform grids for the task of semantic segmentation. We also utilize Defgrid at the output layers for the task of object mask annotation, and show that reasoning about object boundaries on our predicted polygonal grid leads to more accurate results over existing pixel-wise and curve-based approaches. We finally showcase DefGrid as a standalone module for unsupervised image segmentation, showing superior performance over existing superpixel-based approaches. Project website: http://www.cs.toronto.edu/~jungao/def-grid ."



Paperid:375
Authors:Hu Wang, Qi Wu, Chunhua Shen
Title: Soft Expert Reward Learning for Vision-and-Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) requires an agent to find a specified spot in an unseen environment by following natural language instructions. Dominant methods based on supervised learning clone expert's behaviours and thus perform better on seen environments, while showing restricted performance on unseen ones. Reinforcement Learning (RL) based models show better generalisation ability but have issues as well, requiring large amount of manual reward engineering is one of which. In this paper, we introduce a Soft Expert Reward Learning (SERL) model to overcome the reward engineering designing and generalisation problems of the VLN task. Our proposed method consists of two complementary components: Soft Expert Distillation (SED) module encourages agents to behave like an expert as much as possible, but in a soft fashion; Self Perceiving (SP) module targets at pushing the agent towards the final destination as fast as possible. Empirically, we evaluate our model on the VLN seen, unseen and test splits and the model outperforms the state-of-the-art methods on most of the evaluation metrics."



Paperid:376
Authors:Yongfei Liu, Xiangyi Zhang, Songyang Zhang, Xuming He
Title: Part-aware Prototype Network for Few-shot Semantic Segmentation
Abstract:
Few-shot semantic segmentation aims to learn to segment new object classes with only a few annotated examples, which has a wide range of real-world applications. Most existing methods either focus on the restrictive setting of one-way few-shot segmentation or suffer from incomplete coverage of object regions. In this paper, we propose a novel few-shot semantic segmentation framework based on the prototype representation. Our key idea is to decompose the holistic class representation into a set of part-aware prototypes, capable of capturing diverse and fine-grained object features. In addition, we propose to leverage unlabeled data to enrich our part-aware prototypes, resulting in better modeling of intra-class variations of semantic objects. We develop a novel graph neural network model to generate and enhance the proposed part-aware prototypes based on labeled and unlabeled images. Extensive experimental evaluations on two benchmarks show that our method outperforms the prior art with a sizable margin."



Paperid:377
Authors:Shujun Wang, Lequan Yu, Caizi Li, Chi-Wing Fu, Pheng-Ann Heng
Title: Learning from Extrinsic and Intrinsic Supervisions for Domain Generalization
Abstract:
The generalization capability of neural networks across domains is crucial for real-world applications. We argue that a generalized object recognition system should well understand the relationships among different images and also the images themselves at the same time. To this end, we present a new domain generalization framework that learns how to generalize across domains simultaneously from extit{extrinsic} relationship supervision and extit{intrinsic} self-supervision for images from multi-source domains. To be specific, we formulate our framework with feature embedding using a multi-task learning paradigm. Besides conducting the common supervised recognition task, we seamlessly integrate a momentum metric learning task and a self-supervised auxiliary task to collectively utilize the extrinsic supervision and intrinsic supervision. Also, we develop an effective momentum metric learning scheme with K-hard negative mining to boost the network to capture image relationship for domain generalization. We demonstrate the effectiveness of our approach on two standard object recognition benchmarks VLCS and PACS, and show that our methods achieve state-of-the-art performance."



Paperid:378
Authors:Mahsa Ehsanpour, Alireza Abedin, Fatemeh Saleh, Javen Shi, Ian Reid , Hamid Rezatofighi
Title: Joint Learning of Social Groups, Individuals Action and Sub-group Activities in Videos
Abstract:
Individuals Action and Sub-group Activities in Videos","The state-of-the art solutions for human activity understanding from a video stream formulate the task as a spatio-temporal problem which requires joint localization of all individuals in the scene and classification of their actions or group activity over time. Who is interacting with whom, e.g. not everyone in a queue is interacting with each other, is often not predicted. There are scenarios where people are best to be split into sub-groups, which we call social groups, and each social group may be engaged in a different social activity. In this paper, we solve the problem of simultaneously grouping people by their social interactions, predicting their individual actions and the social activity of each social group, which we call the social task. Our main contributions are: i) we propose an end-to-end trainable framework for the social task; ii) our proposed method also sets the state-of-the-art results on two widely adopted benchmarks for the traditional group activity recognition task (assuming individuals of the scene form a single group and predicting a single group activity label for the scene); iii) we introduce new annotations on an existing group activity dataset, re-purposing it for the social task."



Paperid:379
Authors:Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo
Title: Whole-Body Human Pose Estimation in the Wild
Abstract:
This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases and large model complexity. To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. To our best knowledge, it is the first benchmark that has manual annotations on the entire human body, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. A single-network model, named ZoomNet, is devised to take into account the hierarchical structure of the full human body to solve the scale variation of different body parts of the same person. ZoomNet is able to significantly outperform existing methods on the proposed COCO-WholeBody dataset. Extensive experiments show that COCO-WholeBody not only can be used to train deep models from scratch for whole-body pose estimation but also can serve as a powerful pre-training dataset for many different tasks such as facial landmark detection and hand keypoint estimation. The dataset is publicly available at \url{https://github.com/jin-s13/COCO-WholeBody}."



Paperid:380
Authors:Bo Li, Evgeniy Martyushev, Gim Hee Lee
Title: Relative Pose Estimation of Calibrated Cameras with Known SE(3) Invariants
Abstract:
The $\mathrm{SE}(3)$ invariants of a pose include its rotation angle and screw translation. In this paper, we present a complete comprehensive study of the relative pose estimation problem for a calibrated camera constrained by known $\mathrm{SE}(3)$ invariant, which involves 5 minimal problems in total. These problems reduces the minimal number of point pairs for relative pose estimation and improves the estimation efficiency and robustness. The $\mathrm{SE}(3)$ invariant constraints can come from extra sensor measurements or motion assumption. Different from conventional relative pose estimation with extra constraints, no extrinsic calibration is required to transform the constraints to the camera frame. This advantage comes from the invariance of $\mathrm{SE}(3)$ invariants cross different coordinate systems on a rigid body and makes the solvers more convenient and flexible in practical applications. Besides proposing the concept of relative pose estimation constrained by $\mathrm{SE}(3)$ invariants, we present a comprehensive study of existing polynomial formulations for relative pose estimation and discover their relationship. Different formulations are carefully chosen for each proposed problems to achieve best efficiency. Experiments on synthetic and real data shows performance improvement compared to conventional relative pose estimation methods."



Paperid:381
Authors:Runkai Zheng, Yinqi Zhang, Daolang Huang, Qingliang Chen
Title: Sequential Convolution and Runge-Kutta Residual Architecture for Image Compressed Sensing
Abstract:
In recent years, Deep Neural Networks (DNN) have empowered Compressed Sensing (CS) substantially and have achieved high reconstruction quality and speed far exceeding traditional CS methods. However, there are still lots of issues to be further explored before it can be practical enough. There are mainly two challenging problems in CS, one is to achieve efficient data sampling, and the other is to reconstruct images with high-quality. To address the two challenges, this paper proposes a novel Runge-Kutta Convolutional Compressed Sensing Network (RK-CCSNet). In the sensing stage, RK-CCSNet applies Sequential Convolutional Module (SCM) to gradually compact measurements through a series of convolution filters. In the reconstruction stage, RK-CCSNet establishes a novel Learned Runge-Kutta Block (LRKB) based on the famous Runge-Kutta methods, reformulating the process of image reconstruction as a discrete dynamical system. Finally, the implementation of RK-CCSNet achieves state-of-the-art performance on influential benchmarks with respect to prestigious baselines, and all the codes are available at https://github.com/rkteddy/RK-CCSNet."



Paperid:382
Authors:Qi Han, Kai Zhao, Jun Xu, Ming-Ming Cheng
Title: Deep Hough Transform for Semantic Line Detection
Abstract:
In this paper, we put forward a simple yet effective method to detect meaningful straight lines, a.k.a. semantic lines, in given scenes. Prior methods take line detection as a special case of object detection, while neglect the inherent characteristics of lines, leading to less efficient and suboptimal results. We propose a one-shot end-to-end framework by incorporating the classical Hough transform into deeply learned representations. By parameterizing lines with slopes and biases, we perform Hough transform to translate deep representations to the parametric space and then directly detect lines in the parametric space. More concretely, we aggregate features along candidate lines on the feature map plane and then assign the aggregated features to corresponding locations in the parametric domain. Consequently, the problem of detecting semantic lines in the spatial domain is transformed to spotting individual points in the parametric domain, making the post-processing steps, i.e. non-maximal suppression, more efficient. Furthermore, our method makes it easy to extract contextual line features, that are critical to accurate line detection. Experimental results on a public dataset demonstrate the advantages of our method over state-of-the-arts. Code and models will be publicly released."



Paperid:383
Authors:Weijian Li, Yuhang Lu, Kang Zheng, Haofu Liao, Chihung Lin, Jiebo Luo, Chi-Tung Cheng, Jing Xiao, Le Lu, Chang-Fu Kuo, Shun Miao
Title: Structured Landmark Detection via Topology-Adapting Deep Graph Learning
Abstract:
Image landmark detection aims to automatically identify the locations of predefined fiducial points. Despite recent success in this field, higher-ordered structural modeling to capture implicit or explicit relationships among anatomical landmarks has not been adequately exploited. In this work, we present a new topology-adapting deep graph learning approach for accurate anatomical facial and medical (e.g., hand, pelvis) landmark detection. The proposed method constructs graph signals leveraging both local image features and global shape features. The adaptive graph topology naturally explores and lands on task-specific structures which are learned end-to-end with two Graph Convolutional Networks (GCNs). Extensive experiments are conducted on three public facial image datasets (WFLW, 300W, and COFW-68) as well as three real-world X-ray medical datasets (Cephalometric (public), Hand and Pelvis). Quantitative results comparing with the previous state-of-the-art approaches across all studied datasets indicating the superior performance in both robustness and accuracy. Qualitative visualizations of the learned graph topologies demonstrate a physically plausible connectivity laying behind the landmarks."



Paperid:384
Authors:Xiangyu Xu, Hao Chen, Francesc Moreno-Noguer, László A. Jeni, Fernando De la Torre
Title: 3D Human Shape and Pose from a Single Low-Resolution Image with Self-Supervised Learning
Abstract:
3D human shape and pose estimation from monocular images has been an active area of research in computer vision, having a substantial impact on the development of new applications, from activity recognition to creating virtual avatars. Existing deep learning methods for 3D human shape and pose estimation rely on relatively high-resolution input images; however, high-resolution visual content is not always available in several practical scenarios such as video surveillance and sports broadcasting. Low-resolution images in real scenarios can vary in a wide range of sizes, and a model trained in one resolution does not typically degrade gracefully across resolutions. Two common approaches to solve the problem of low-resolution input are applying super-resolution techniques to the input images which may result in visual artifacts, or simply training one model for each resolution, which is impractical in many realistic applications. To address the above issues, this paper proposes a novel algorithm called RSC-Net, which consists of a Resolution-aware network, a Self-supervision loss, and a Contrastive learning scheme. The proposed network is able to learn the 3D body shape and pose across different resolutions with a single model. The self-supervision loss encourages scale-consistency of the output, and the contrastive learning scheme enforces scale-consistency of the deep features. We show that both these new training losses provide robustness when learning 3D shape and pose in a weakly-supervised manner. Extensive experiments demonstrate that the RSC-Net can achieve consistently better results than the state-of-the-art methods for challenging low-resolution images."



Paperid:385
Authors:Prithvijit Chattopadhyay, Yogesh Balaji, Judy Hoffman
Title: Learning to Balance Specificity and Invariance for In and Out of Domain Generalization
Abstract:
We introduce Domain-specific Masks for Generalization, a model for improving both in-domain and out-of-domain generalization performance. For domain generalization, the goal is to learn from a set of source domains to produce a single model that will best generalize to an unseen target domain. As such, many prior approaches focus on learning representations which persist across all source domains with the assumption that these domain agnostic representations will generalize well. However, often individual domains contain characteristics which are unique and when leveraged can significantly aid in-domain recognition performance. To produce a model which best generalizes to both seen and unseen domains, we propose learning domain specific masks. The masks are encouraged to learn a balance of domain-invariant and domain-specific features, thus enabling a model which can benefit from the predictive power of specialized features while retaining the universal applicability of domain-invariant features. We demonstrate competitive performance compared to naive baselines and state-of-the-art methods on both PACS and DomainNet."



Paperid:386
Authors:Taesung Park Alexei A. Efros Richard Zhang Jun-Yan Zhu
Title: Contrastive Learning for Unpaired Image-to-Image Translation
Abstract:
In image-to-image translation, each patch in the output should reflect the content of the corresponding patch in the input, independent of domain. We propose a straightforward method for doing so -- maximizing mutual information between the two, using a framework based on contrastive learning. The method encourages two elements (corresponding patches) to map to a similar point in a learned feature space, relative to other elements (other patches) in the dataset, referred to as negatives. We explore several critical design choices for making contrastive learning effective in the image synthesis setting. Notably, we use a multilayer, patch-based approach, rather than operate on entire images. Furthermore, we draw negatives from within the input image itself, rather than from the rest of the dataset. We demonstrate that our framework enables one-sided translation in the unpaired image-to-image translation setting, while improving quality and reducing training time. In addition, our method can even be extended to the training setting where each ``domain'' is only a single image."



Paperid:387
Authors:Ye Yuan, Kris Kitani
Title: DLow: Diversifying Latent Flows for Diverse Human Motion Prediction
Abstract:
Deep generative models are often used for human motion prediction as they are able to model multi-modal data distributions and characterize diverse human behavior. While much care has been taken into designing and learning deep generative models, how to efficiently produce diverse samples from a deep generative model after it has been trained is still an under-explored problem. To obtain samples from a pretrained generative model, most existing generative human motion prediction methods draw a set of independent Gaussian latent codes and convert them to motion samples. Clearly, this random sampling strategy is not guaranteed to produce diverse samples for two reasons: (1) The independent sampling cannot force the samples to be diverse; (2) The sampling is based solely on likelihood which may only produce samples that correspond to the major modes of the data distribution. To address these problems, we propose a novel sampling method, Diversifying Latent Flows (DLow), to produce a diverse set of samples from a pretrained deep generative model. Unlike random (independent) sampling, the proposed DLow sampling method samples a single random variable and then maps it with a set of learnable mapping functions to a set of correlated latent codes. The correlated latent codes are then decoded into a set of correlated samples. During training, DLow uses a diversity-promoting prior over samples as an objective to optimize the latent mappings to improve sample diversity. The design of the prior is highly flexible and can be customized to generate diverse motions with common features (e.g., similar leg motion but diverse upper-body motion). Our experiments demonstrate that DLow outperforms state-of-the-art baseline methods in terms of sample diversity and accuracy."



Paperid:388
Authors:Haozhe Xie, Hongxun Yao, Shangchen Zhou, Jiageng Mao, Shengping Zhang, Wenxiu Sun
Title: GRNet: Gridding Residual Network for Dense Point Cloud Completion
Abstract:
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN and TopNet) use Multi-layer Perceptrons (MLPs) to directly process point clouds, which may cause the loss of details because the structural and context of point clouds are not fully considered. To solve this problem, we introduce 3D grids as intermediate representations to regularize unordered point clouds. We therefore propose a novel Gridding Residual Network (GRNet) for point cloud completion. In particular, we devise two novel differentiable layers, named Gridding and Gridding Reverse, to convert between point clouds and 3D grids without losing structural information. We also present the differentiable Cubic Feature Sampling layer to extract features of neighboring points, which preserves context information. In addition, we design a new loss function, namely Gridding Loss, to calculate the L1 distance between the 3D grids of the predicted and ground truth point clouds, which is helpful to recover details. Experimental results indicate that the proposed GRNet performs favorably against state-of-the-art methods on the ShapeNet, Completion3D, and KITTI benchmarks."



Paperid:389
Authors:Saihui Hou, Chunshui Cao, Xu Liu, Yongzhen Huang
Title: Gait Lateral Network: Learning Discriminative and Compact Representations for Gait Recognition
Abstract:
Gait recognition aims at identifying different people by the walking patterns, which can be conducted at a long distance without the cooperation of subjects. A key challenge for gait recognition is to learn representations from the silhouettes that are invariant to the factors such as clothing, carrying conditions and camera viewpoints. Besides being discriminative for identification, the gait representations should also be compact for storage to keep millions of subjects registered in the gallery. In this work, we propose a novel network named Gait Lateral Network (GLN) which can learn both discriminative and compact representations from the silhouettes for gait recognition. Specifically, GLN leverages the inherent feature pyramid in deep convolutional neural networks to enhance the gait representations. The silhouette-level and set-level features extracted by different stages are merged with the lateral connections in a top-down manner. Besides, GLN is equipped with a Compact Block which can significantly reduce the dimension of the gait representations without hindering the accuracy. Extensive experiments on CASIA-B and OUMVLP show that GLN can achieve state-of-the-art performance using the $256$-dimensional representations. Under the most challenging condition of walking in different clothes on CASIA-B, our method improves the rank-1 accuracy by $6.45\%$."



Paperid:390
Authors:Xiaoming Li, Chaofeng Chen, Shangchen Zhou, Xianhui Lin, Wangmeng Zuo, Lei Zhang
Title: Blind Face Restoration via Deep Multi-scale Component Dictionaries
Abstract:
Recent reference-based face restoration methods have received considerable attention due to their great capability in recovering high-frequency details on real low-quality images. However, most of these methods require a high-quality reference image of the same identity, making them only applicable in limited scenes. To address this issue, this paper suggests a deep face dictionary network (termed as DFDNet) to guide the restoration process of degraded observations. To begin with, we use K-means to generate deep dictionaries for perceptually significant face components (i.e., left/right eyes, nose and mouth) from high-quality images. Next, with the degraded input, we match and select the most similar component features from their corresponding dictionaries and transfer the high-quality details to the input via the proposed dictionary feature transfer (DFT) block. In particular, component AdaIN is leveraged to eliminate the style diversity between the input and dictionary features (e.g., illumination), and a confidence score is proposed to adaptively fuse the dictionary feature to the input. Finally, multi-scale dictionaries are adopted in a progressive manner to enable the coarse-to-fine restoration. Experiments show that our proposed method can achieve plausible performance in both quantitative and qualitative evaluation, and more importantly, can generate realistic and promising results on real degraded images without requiring an identity-belonging reference. The source code and models are available at https://github.com/csxmli2016/DFDNet."



Paperid:391
Authors:Byungjoo Kim, Bryce Chudomelka, Jinyoung Park, Jaewoo Kang, Youngjoon Hong, Hyunwoo J. Kim
Title: Robust Neural Networks inspired by Strong Stability Preserving Runge-Kutta methods
Abstract:
Deep neural networks have achieved state-of-the-art performance in a variety of fields. Recent works observe that a class of widely used neural networks can be viewed as the Euler method of numerical discretization. From the numerical discretization perspective, Strong Stability Preserving (SSP) methods are more advanced techniques than the explicit Euler method that produce both accurate and stable solutions. Motivated by the SSP property and a generalized Runge-Kutta method, we proposed Strong Stability Preserving networks (SSP networks) which improve robustness against adversarial attacks. We empirically demonstrate that the proposed networks improve the robustness against adversarial examples without any defensive methods. Further, the SSP networks are complementary with a state-of-the-art adversarial training scheme. Lastly, our experiments show that SSP networks suppress the blow-up of adversarial perturbations. Our results open up a way to study robust architectures of neural networks leveraging rich knowledge from numerical discretization literature."



Paperid:392
Authors:Evangelos Sariyanidi, Casey J. Zampella, Robert T. Schultz, Birkan Tunc
Title: Inequality-Constrained and Robust 3D Face Model Fitting
Abstract:
Fitting 3D morphable models (3DMMs) on faces is a well-studied problem, motivated by various industrial and research applications. 3DMMs express a 3D facial shape as a linear sum of basis functions. The resulting shape, however, is a plausible face only when the basis coefficients take values within limited intervals. Methods based on unconstrained optimization address this issue with a weighted $ll_2$ penalty on coefficients; however, determining the weight of this penalty is difficult, and the existence of a unique weight that works universally is questionable. We propose a new formulation that does not require the tuning of any weight parameter. Specifically, we formulate 3DMM fitting as an inequality-constrained optimization problem, where the primary constraint is that basis coefficients should not exceed the interval that is learned when the 3DMM is constructed. We employ additional constraints to exploit sparse landmark detectors by enforcing the facial shape to be within the error bounds of a reliable detector. To enable operation ``in-the-wild'', we use a robust objective function, namely gradient correlation. Our approach performs comparably with deep learning (DL) methods on ``in-the-wild'' data, which have inexact ground truth, and better than DL methods on more controlled data with exact ground truth. Since our formulation does not require any learning, it enjoys a versatility that allows it to operate with multiple frames of arbitrary sizes. This study's results encourage further research on 3DMM fitting with inequality-constrained optimization methods, which remained unexplored compared to unconstrained methods."



Paperid:393
Authors:Juan C. Pérez, Motasem Alfarra, Guillaume Jeanneret, Adel Bibi, Ali Thabet, Bernard Ghanem, Pablo Arbeláez
Title: Gabor Layers Enhance Network Robustness
Abstract:
We revisit the benefits of merging classical vision concepts with deep learning models. In particular, we explore the effect of replacing the first layers of various deep architectures with Gabor layers (i.e. convolutional layers with filters that are based on learnable Gabor parameters) on robustness against adversarial attacks. We observe that architectures with Gabor layers gain a consistent boost in robustness over regular models and maintain high generalizing test performance. We then exploit the analytical expression of Gabor filters to derive a compact expression for a Lipschitz constant of such filters, and harness this theoretical result to develop a regularizer we use during training to further enhance network robustness. We conduct extensive experiments with various architectures (LeNet, AlexNet, VGG16, and WideResNet) on several datasets (MNIST, SVHN, CIFAR10 and CIFAR100) and demonstrate large empirical robustness gains. Furthermore, we experimentally show how our regularizer provides consistent robustness improvements."



Paperid:394
Authors:Shuchen Weng, Wenbo Li, Dawei Li, Hongxia Jin, Boxin Shi
Title: Conditional Image Repainting via Semantic Bridge and Piecewise Value Function
Abstract:
We study conditional image repainting where a model is trained to generate visual content conditioned on user inputs, and composite the generated content seamlessly onto a user provided image while preserving the semantics of users' inputs. The content generation community have been pursuing to lower the skill barriers. The usage of human language is the rose among horns for this purpose, because the language is friendly to users but poses great difficulties for the model in associating relevant words with the semantically ambiguous regions. To resolve this issue, we propose a delicate mechanism which bridges the semantic chasm between the language input and the generated visual content. The state-of-the-art image compositing techniques pose a latent ceiling of fidelity for the composited content during the adversarial training process. In this work, we improve the compositing by breaking through the latent ceiling using a novel piecewise value function. We demonstrate on two datasets that the proposed techniques can better assist tackling conditional image repainting compared to the existing ones."



Paperid:395
Authors:Taihong Xiao, Jinwei Yuan, Deqing Sun, Qifei Wang Xin-Yu Zhang, Kehan Xu, Ming-Hsuan Yang
Title: Learnable Cost Volume Using the Cayley Representation
Abstract:
Cost volume is an essential component of recent deep models for optical flow estimation and is usually constructed by calculating the inner product between two feature vectors. However, the standard inner product in the commonly-used cost volume may limit the representation capacity of flow models because it neglects the correlation among different channel dimensions and weighs each dimension equally. To address this issue, we propose a learnable cost volume (LCV) using an elliptical inner product, which generalizes the standard inner product by a positive definite kernel matrix. To guarantee its positive definiteness, we perform spectral decomposition on the kernel matrix and re-parameterize it via the Cayley representation. The proposed LCV is a lightweight module and can be easily plugged into existing models to replace the vanilla cost volume. Experimental results show that the LCV module not only improves the accuracy of state-of-the-art models on standard benchmarks, but also promotes their robustness against illumination change, noises, and adversarial perturbations of the input signals."



Paperid:396
Authors:Chaojian Li, Tianlong Chen, Haoran You, Zhangyang Wang, Yingyan Lin
Title: HALO: Hardware-Aware Learning to Optimize
Abstract:
There has been an explosive demand for bringing machine learning (ML) powered intelligence into numerous Internet-of-Things (IoT) devices. However, the effectiveness of such intelligent functionality requires in-situ continuous model adaptation for adapting to new data and environments, while the on-device computing and energy resources are usually extremely constrained. Neither traditional hand-crafted (e.g., SGD, Adagrad, and Adam) nor existing meta optimizers are specifically designed to meet those challenges, as the former requires tedious hyper-parameter tuning while the latter are often costly due to the meta algorithms’ own overhead. To this end, we propose hardware-aware learning to optimize (HALO), a practical meta optimizer dedicated to resource-efficient on-device adaptation. Our HALO optimizer features the following highlights: (1) faster adaptation speed (i.e., taking fewer data or iterations to reach a specified accuracy) by introducing a new regularizer to promote empirical generalization; and (2) lower per-iteration complexity, thanks to a stochastic structural sparsity regularizer being enforced. Furthermore, the optimizer itself is designed as a very light-weight RNN and thus incurs negligible overhead. Ablation studies and experiments on five datasets, six optimizees, and two state-of-the-art (SOTA) edge AI devices validate that, while always achieving a better accuracy (↑0.46% - ↑20.28%), HALO can greatly trim down the energy cost (up to ↓60%) in adaptation, quantified using an IoT device or SOTA simulator. Codes and pre-trained models are at https://github.com/RICE-EIC/HALO ."



Paperid:397
Authors:Jia Zheng, Junfei Zhang, Jing Li, Rui Tang, Shenghua Gao, Zihan Zhou
Title: Structured3D: A Large Photo-realistic Dataset for Structured 3D Modeling
Abstract:
Recently, there has been growing interest in developing learning-based methods to detect and utilize salient semi-global or global structures, such as junctions, lines, planes, cuboids, smooth surfaces, and all types of symmetries, for 3D scene modeling and understanding. However, the ground truth annotations are often obtained via human labor, which is particularly challenging and inefficient for such tasks due to the large number of 3D structure instances (e.g., line segments) and other factors such as viewpoints and occlusions. In this paper, we present a new synthetic dataset, Structured3D, with the aim of providing large-scale photo-realistic images with rich 3D structure annotations for a wide spectrum of structured 3D modeling tasks. We take advantage of the availability of professional interior designs and automatically extract 3D structures from them. We generate high-quality images with an industry-leading rendering engine. We use our synthetic dataset in combination with real images to train deep networks for room layout estimation and demonstrate improved performance on benchmark datasets."



Paperid:398
Authors:Yonghyun Kim, Wonpyo Park, Jongju Shin
Title: BroadFace: Looking at Tens of Thousands of People at Once for Face Recognition
Abstract:
The datasets of face recognition contain an enormous number of identities and instances. However, conventional methods have difficulty in reflecting the entire distribution of the datasets because a mini-batch of small size contains only a small portion of all identities. To overcome this difficulty, we propose a novel method called BroadFace, which is a learning process to consider a massive set of identities, comprehensively. In BroadFace, a linear classifier learns optimal decision boundaries among identities from a large number of embedding vectors accumulated over past iterations. By referring more instances at once, the optimality of the classifier is naturally increased on the entire datasets. Thus, the encoder is also globally optimized by referring the weight matrix of the classifier. Moreover, we propose a novel compensation method to increase the number of referenced instances in the training stage. BroadFace can be easily applied on many existing methods to accelerate a learning process and obtain a significant improvement in accuracy without extra computational burden at inference stage. We perform extensive ablation studies and experiments on various datasets to show the effectiveness of BroadFace, and also empirically prove the validity of our compensation method. BroadFace achieves the state-of-the-art results with significant improvements on nine datasets in 1:1 face verification and 1:N face identification tasks, and is also effective in image retrieval."



Paperid:399
Authors:Xinzhe Han, Shuhui Wang, Chi Su, Weigang Zhang, Qingming Huang, Qi Tian
Title: Interpretable Visual Reasoning via Probabilistic Formulation under Natural Supervision
Abstract:
Visual reasoning is crucial for visual question answering (VQA). However, without labelled programs, implicit reasoning under natural supervision is still quite challenging and previous models are hard to interpret. In this paper, we rethink implicit reasoning process in VQA, and propose a new formulation which maximizes the log-likelihood of joint distribution for the observed question and predicted answer. Accordingly, we derive a Temporal Reasoning Network (TRN) framework which models the implicit reasoning process as sequential planning in latent space. Our model is interpretable on both model design in probabilist and reasoning process via visualization. We experimentally demonstrate that TRN can support implicit reasoning across various datasets. The experiment results of our model are competitive to existing implicit reasoning models and surpass baseline by large margin on complicated reasoning tasks without extra computation cost in forward stage."



Paperid:400
Authors:Sujoy Paul, Yi-Hsuan Tsai, Samuel Schulter, Amit K. Roy-Chowdhury, Manmohan Chandraker
Title: Domain Adaptive Semantic Segmentation Using Weak Labels
Abstract:
We propose a novel framework for domain adaptation in semantic segmentation with image-level weak labels in the target domain. The weak labels may be obtained based on a model prediction for unsupervised domain adaptation (UDA), or from a human oracle in a new weakly-supervised domain adaptation (WDA) paradigm for semantic segmentation. Using weak labels is both practical and useful, since (i) collecting image-level target annotations is comparably cheap in WDA and incurs no cost in UDA, and (ii) it opens the opportunity for category-wise domain alignment. Our framework uses weak labels to enable the interplay between feature alignment and pseudo-labeling, improving both in the process of domain adaptation. Specifically, we develop a weak-label classification module to enforce the network to attend to certain categories, and then use such training signals to guide the proposed category-wise alignment method. In experiments, we show considerable improvements with respect to the existing state-of-the-arts in UDA and present a new benchmark in the WDA setting."



Paperid:401
Authors:Guodong Xu, Ziwei Liu, Xiaoxiao Li, Chen Change Loy
Title: Knowledge Distillation Meets Self-Supervision
Abstract:
Knowledge distillation, which involves extracting the ""dark knowledge"" from a teacher network to guide the learning of a student network, has emerged as an important technique for model compression and transfer learning. Unlike previous works that exploit architecture-specific cues such as activation and attention for distillation, here we wish to explore a more general and model-agnostic approach for extracting ""richer dark knowledge"" from the pre-trained teacher model. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. For example, when performing contrastive learning between transformed entities, the noisy predictions of the teacher network reflect its intrinsic composition of semantic and pose information. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student. In this paper, we discuss practical ways to exploit those noisy self-supervision signals with selective transfer for distillation. We further show that self-supervision signals improve conventional distillation with substantial gains under few-shot and noisy-label scenarios. Given the richer knowledge mined from self-supervision, our knowledge distillation approach achieves state-of-the-art performance on standard benchmarks, i.e., CIFAR100 and ImageNet, under both similar-architecture and cross-architecture settings. The advantage is even more pronounced under the cross-architecture setting, where our method outperforms the state of the art CRD by an average of 2.3% in accuracy rate on CIFAR100 across six different teacher-student pairs."



Paperid:402
Authors:Ignacio Rocco, Relja Arandjelović, Josef Sivic
Title: Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions
Abstract:
In this work we target the problem of estimating accurately localised correspondences between a pair of images. We adopt the recent Neighbourhood Consensus Networks that have demonstrated promising performance for difficult correspondence problems and propose modifications to overcome their main limitations: large memory consumption, large inference time and poorly localised correspondences. Our proposed modifications can reduce the memory footprint and execution time more than $10 imes$, with equivalent results. This is achieved by sparsifying the correlation tensor containing tentative matches, and its subsequent processing with a 4D CNN using submanifold sparse convolutions. Localisation accuracy is significantly improved by processing the input images in higher resolution, which is possible due to the reduced memory footprint, and by a novel two-stage correspondence relocalisation module. The proposed Sparse-NCNet method obtains state-of-the art results on the HPatches Sequences and InLoc visual localisation benchmarks, and competitive results on the Aachen Day-Night benchmark."



Paperid:403
Authors:Ioannis Marras, Grigorios G. Chrysos, Ioannis Alexiou, Gregory Slabaugh, Stefanos Zafeiriou
Title: Reconstructing the Noise Variance Manifold for Image Denoising
Abstract:
Deep Convolutional Neural Networks (CNNs) have been successfully used in many low-level vision problems like image denoising. Although the conditional image generation techniques have led to large improvements in this task, there has been little effort in providing conditional generative adversarial networks (cGANs) with an explicit way of understanding the image noise for object-independent denoising reliable for real-world applications. The task of leveraging structures in the target space is unstable due to the complexity of patterns in natural scenes, so the presence of unnatural artifacts or over-smoothed image areas cannot be avoided. To fill the gap, in this work we introduce the idea of a cGAN which explicitly leverages structure in the image noise variance space. By learning directly a low dimensional manifold of the image noise variance, the generator promotes the removal from the noisy image only that information which spans this manifold. This idea brings many advantages while it can be appended at the end of any denoiser to significantly improve its performance. Based on our experiments, our model substantially outperforms existing state-of-the-art architectures, resulting in denoised images with less oversmoothing and better detail."



Paperid:404
Authors:Xiaoxiao Long, Lingjie Liu, Christian Theobalt, Wenping Wang
Title: Occlusion-Aware Depth Estimation with Adaptive Normal Constraints
Abstract:
We present a new learning-based method for multi-frame depth estimation from a color video, which is a fundamental problem in scene understanding, robot navigation or handheld 3D reconstruction. While recent learning-based methods estimate depth at high accuracy, 3D point clouds exported from their depth maps often fail to preserve important geometric feature (e.g., corners, edges, planes) of man-made scenes. Widely-used pixel-wise depth errors do not specifically penalize inconsistency on these features. These inaccuracies are particularly severe when subsequent depth reconstructions are accumulated in an attempt to scan a full environment with man-made objects with this kind of features. Our depth estimation algorithm therefore introduces a Combined Normal Map (CNM) constraint, which is designed to better preserve high-curvature features and global planar regions. In order to further improve the depth estimation accuracy, we introduce a new occlusion-aware strategy that aggregates initial depth predictions from multiple adjacent views into one final depth map and occlusion probability map for the current reference view. With the CNM constraint and the occlusion-aware loss, our method outperforms the state-of-the-art in terms of depth estimation accuracy, and preserves essential geometric features of man-made indoor scenes much better than other algorithms."



Paperid:405
Authors:Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Title: VisualEchoes: Spatial Image Representation Learning through Echolocation
Abstract:
Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation---with results comparable or even better than heavily supervised pre-training. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world."



Paperid:406
Authors:Andrew Brown, Weidi Xie, Vicky Kalogeiton, Andrew Zisserman
Title: Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval
Abstract:
Optimising a ranking-based metric, such as Average Precision (AP), is notoriously challenging due to the fact that it is non-differentiable, and hence cannot be optimised directly using gradient-descent methods. To this end, we introduce an objective that optimises instead a smoothed approximation of AP, coined Smooth-AP. Smooth-AP is a plug-and-play objective function that allows for end-to-end training of deep networks with a simple and elegant implementation. We also present an analysis for why directly optimising the ranking based metric of AP offers benefits over other deep metric learning losses. We apply Smooth-AP to standard retrieval benchmarks: Stanford Online products and VehicleID, and also evaluate on larger-scale datasets: INaturalist for fine-grained category retrieval, and VGGFace2 and IJB-C for face retrieval. In all cases, we improve the performance over the state-of-the-art, especially for larger-scale datasets, thus demonstrating the effectiveness and scalability of Smooth-AP to real-world scenarios. "



Paperid:407
Authors:Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon Shlens
Title: Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation
Abstract:
Supervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences and extra images to surpass state-of-the-art performance on core computer vision tasks."



Paperid:408
Authors:Yash Kant, Dhruv Batra, Peter Anderson, Alexander Schwing, Devi Parikh, Jiasen Lu, Harsh Agrawal
Title: Spatially Aware Multimodal Transformers for TextVQA
Abstract:
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding."



Paperid:409
Authors:Cheng-Chun Hsu, Yi-Hsuan Tsai, Yen-Yu Lin, Ming-Hsuan Yang
Title: Every Pixel Matters: Center-aware Feature Alignment for Domain Adaptive Object Detector
Abstract:
A domain adaptive object detector aims to adapt itself to unseen domains that may contain variations of object appearance, viewpoints or backgrounds. Most existing solutions adopt feature alignment either on the image level or instance level. However, image-level alignment on global features may tangle foreground/background pixels at the same time, while instance-level alignment using proposals may suffer from the background noise. Different from existing solutions, we propose a domain adaptation framework that accounts for each pixel, especially via predicting pixel-wise objectness and centerness. Specifically, the proposed method carries out center-aware alignment by paying more attention to foreground pixels, hence achieving better adaptation across domains. We demonstrate our method on numerous adaptation settings with extensive experimental results and show favorable performance against existing state-of-the-art algorithms."



Paperid:410
Authors:Taeyoung Son Juwon Kang Namyup Kim Sunghyun Cho Suha Kwak
Title: URIE: Universal Image Enhancement for Visual Recognition in the Wild
Abstract:
Despite the great advances in visual recognition, it has been witnessed that recognition models trained on clean images of common datasets are not robust against distorted images in the real world. To tackle this issue, we present a Universal and Recognition-friendly Image Enhancement network, dubbed URIE, which is attached in front of existing recognition models and enhances distorted input to improve their performance without retraining them. URIE is universal in that it aims to handle various factors of image degradation and to be incorporated with any arbitrary recognition models. Also, it is recognition-friendly since it is optimized to improve the robustness of following recognition models, instead of perceptual quality of output image. Our experiments demonstrate that URIE can handle various and latent image distortions and improve the performance of existing models for five diverse recognition tasks when input images are degraded. "



Paperid:411
Authors:Hongwei Yi, Zizhuang Wei, Mingyu Ding, Runze Zhang, Yisong Chen, Guoping Wang, Yu-Wing Tai
Title: Pyramid Multi-view Stereo Net with Self-adaptive View Aggregation
Abstract:
In this paper, we propose an effective and efficient pyramid multi-view stereo (MVS) net with self-adaptive view aggregation for accurate and complete dense point cloud reconstruction. Different from using mean square variance to generate cost volume in previous deep-learning based MVS methods, our VA-MVSNet incorporates the cost variances in different views with small extra memory consumption by introducing two novel self-adaptive view aggregation: pixel-wise view aggregation and voxel-wise view aggregation. To further boost the robustness and completeness of 3D point cloud reconstruction, we extend VA-MVSNet with pyramid multi-scale images input as PVA-MVSNet, where multi-metric constraints are leveraged to aggregate the reliable depth estimation at the coarser scale to fill in the mismatched regions at the finer scale. Experimental results show that our approach establishes a new state-of-the-art on the DTU dataset with significant improvements in the completeness and overall quality, and has strong generalization by achieving a comparable performance as the state-of-the-art methods on the Tanks and Temples benchmark. The codebase is available at \hyperlink{https://github.com/yhw-yhw/D2HC-RMVSNet}{https://github.com/yhw-yhw/D2HC-RMVSNet}."



Paperid:412
Authors:Junbing Li, Changqing Zhang, Pengfei Zhu, Baoyuan Wu, Lei Chen, Qinghua Hu
Title: SPL-MLL: Selecting Predictable Landmarks for Multi-Label Learning
Abstract:
Although significant progress achieved, multi-label classification is still challenging due to the complexity of correlations among different labels. Furthermore, modeling the relationships between input and some (dull) classes further increases the difficulty of accurately predicting all possible labels. In this work, we propose to select a small subset of labels as landmarks which are easy to predict according to input (predictable) and can well recover the other possible labels (representative). Different from existing methods which separate the landmark selection and landmark prediction in the 2-step manner, the proposed algorithm, termed Selecting Predictable Landmarks for MultiLabel Learning (SPL-MLL), jointly conducts landmark selection, landmark prediction, and label recovery in a unified framework, to ensure both the representativeness and predictableness for selected landmarks. We employ the Alternating Direction Method (ADM) to solve our problem. Empirical studies on real-world datasets show that our method achieves superior classification performance over other state-of-the-art methods."



Paperid:413
Authors:Yihao Zhao, Ruihai Wu, Hao Dong
Title: Unpaired Image-to-Image Translation using Adversarial Consistency Loss
Abstract:
Unpaired image-to-image translation is a class of vision problems whose goal is to find the mapping between different image domains using unpaired training data. Cycle-consistency loss is a widely used constraint for such problems. However, due to the strict pixel-level constraint, it cannot perform geometric changes, remove large objects, or ignore irrelevant texture. In this paper, we propose a novel adversarial-consistency loss for image-to-image translation. This loss does not require the translated image to be translated back to be a specific source image but can encourage the translated images to retain important features of the source images and overcome the drawbacks of cycle-consistency loss noted above. Our method achieves state-of-the-art results on three challenging tasks: glasses removal, male-to-female translation, and selfie-to-anime translation."



Paperid:414
Authors:Manyuan Zhang, Guanglu Song, Hang Zhou, Yu Liu
Title: Discriminability Distillation in Group Representation Learning
Abstract:
Learning group representation is a commonly concerned issue in tasks where the basic unit is a group, set, or sequence. Previously, the research community tries to tackle it by aggregating the elements in a group based on an extit{indicator} either defined by humans such as the extit{quality} and extit{saliency}, or generated by a black box such as the attention score. This article provides a more essential and explicable view. We claim the most significant indicator to show whether the group representation can be benefited from one of its element is not the quality or an inexplicable score, but the extit{discriminability w.r.t.} the model. We explicitly design the extit{discrimiability} using embedded class centroids on a proxy set. We show the discrimiability knowledge has good properties that can be distilled by a light-weight distillation network and can be generalized on the unseen target set. The whole procedure is denoted as extit{discriminability distillation learning} (DDL). The proposed DDL can be flexibly plugged into many group-based recognition tasks without influencing the original training procedures. Comprehensive experiments on various tasks have proven the effectiveness of DDL for both accuracy and efficiency. Moreover, it pushes forward the state-of-the-art results on these tasks by an impressive margin."



Paperid:415
Authors:Vasileios Choutas, Georgios Pavlakos, Timo Bolkart, Dimitrios Tzionas , Michael J. Black
Title: Monocular Expressive Body Regression through Body-Driven Attention
Abstract:
To understand how people look, interact, or perform tasks, we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de."



Paperid:416
Authors:Zongsheng Yue, Qian Zhao, Lei Zhang, Deyu Meng
Title: Dual Adversarial Network: Toward Real-world Noise Removal and Noise Generation
Abstract:
Real-world image noise removal is a long-standing yet very challenging task in computer vision. The success of deep neural network in denoising stimulates the research of noise generation, aiming at synthesizing more pairs of clean-noisy images to facilitate the training of deep. In this work, we propose a novel unified framework to simultaneously deal with the noise removal and noise generation tasks. Instead of only inferring the posterior distribution of the latent clean image conditioned on the observed noisy image in traditional MAP framework, it learns the joint distribution of the clean-noisy image pairs. Specifically, we approximate the joint distribution with two different factorized forms, which can be formulated as a denoiser mapping the noisy image to the clean one and a generator mapping the clean to the noisy one. The learned joint distribution implicitly contains all the information between the noisy and clean images, avoiding the necessity of manually designing the image priors and noise assumptions as traditional. Besides, the performance of our denoiser can be further improved by augmenting the original training dataset with the learned generator. Moreover, we propose two metrics to assess the quality of the generated noisy image, for which, to the best of our knowledge, such metrics are first proposed along this research line. Extensive experiments have conducted to demonstrate the superiority of our method over the state-of-the-arts in terms of both the real noise removal and noise generation tasks."



Paperid:417
Authors:Tianrui Hui, Si Liu, Shaofei Huang, Guanbin Li, Sansi Yu, Faxi Zhang, Jizhong Han
Title: Linguistic Structure Guided Context Modeling for Referring Image Segmentation
Abstract:
Referring image segmentation aims to predict the foreground mask of the object referred by a natural language sentence. Multimodal context of the sentence is crucial to distinguish the referent from the background. Existing methods either insufficiently or redundantly model the multimodal context. To tackle this problem, we propose a “gather-propagate-distribute” scheme to model multimodal context by crossmodal interaction and implement this scheme as a novel Linguistic Structure guided Context Modeling (LSCM) module. Our LSCM module builds a Dependency Parsing Tree suppressed Word Graph (DPT-WG) which guides all the words to include valid multimodal context of the sentence while excluding disturbing ones through three steps over the multimodal feature, i.e., gathering, constrained propagation and distributing. Extensive experiments on four benchmarks demonstrate that our method outperforms all the previous state-of-the-arts."



Paperid:418
Authors:Tzu-Ming Harry Hsu, Hang Qi, Matthew Brown
Title: Federated Visual Classification with Real-World Data Distribution
Abstract:
Federated Learning enables visual models to be trained on-device, bringing advantages for user privacy (data need never leave the device), but challenges in terms of data diversity and quality. Whilst typical models in the datacenter are trained using data that are independent and identically distributed (IID), data at source are typically far from IID. Furthermore, differing quantities of data are typically available at each device (imbalance). In this work, we characterize the effect these real-world data distributions have on distributed learning, using as a benchmark the standard Federated Averaging (FedAvg) algorithm. To do so, we introduce two new large-scale datasets for species and landmark classification, with realistic per-user data splits that simulate real-world edge learning scenarios. We also develop two new algorithms (FedVC, FedIR) that intelligently resample and reweight over the client pool, bringing large improvements in accuracy and stability in training. The datasets are made available online."



Paperid:419
Authors:Angelo Porrello, Luca Bergamini, Simone Calderara
Title: Robust Re-Identification by Multiple Views Knowledge Distillation
Abstract:
To achieve robustness in Re-Identification, standard methods leverage tracking information in a Video-To-Video fashion. However, these solutions face a large drop in performance for single image queries (e.g., Image-To-Video setting). Recent works address this severe degradation by transferring temporal information from a Video-based network to an Image-based one. In this work, we devise a training strategy that allows the transfer of a superior knowledge, arising from a set of views depicting the target object. Our proposal - Views Knowledge Distillation (VKD) - pins this visual variety as a supervision signal within a teacher-student framework, where the teacher educates a student who observes fewer views. As a result, the student outperforms not only its teacher but also the current state-of-the-art in Image-To-Video by a wide margin (6.3% mAP on MARS, 8.6% on Duke-Video-ReId and 5% on VeRi-776). A thorough analysis - on Person, Vehicle and Animal Re-ID - investigates the properties of VKD from a qualitatively and quantitatively perspective. Code is available at https://github.com/aimagelab/VKD."



Paperid:420
Authors:Abdullah Abuolaim, Michael S. Brown
Title: Defocus Deblurring Using Dual-Pixel Data
Abstract:
Defocus blur arises in images that are captured with a shallow depth of field due to the use of a wide aperture. Correcting defocus blur is challenging because the blur is spatially varying and difficult to estimate. We propose an effective defocus deblurring method that exploits data available on dual-pixel (DP) sensors found on most modern cameras. DP sensors are used to assist a camera's auto-focus by capturing two sub-aperture views of the scene in a single image shot. The two sub-aperture images are used to calculate the appropriate lens position to focus on a particular scene region and are discarded afterwards. We introduce a deep neural network (DNN) architecture that uses these discarded sub-aperture images to reduce defocus blur. A key contribution of our effort is a carefully captured dataset of 500 scenes (2000 images) where each scene has: (i) an image with defocus blur captured at a large aperture; (ii) the two associated DP sub-aperture views; and (iii) the corresponding all-in-focus image captured with a small aperture. Our proposed DNN produces results that are significantly better than conventional single image methods in terms of both quantitative and perceptual metrics -- all from data that is already available on the camera but ignored.



Paperid:421
Authors:Tianshu Yu, Yikang Li, Baoxin Li
Title: RhyRNN: Rhythmic RNN for Recognizing Events in Long and Complex Videos
Abstract:
Though many successful approaches have been proposed for recognizing events in short and homogeneous videos, doing so with long and complex videos remains a challenge. One particular reason is that events in long and complex videos can consist of multiple heterogeneous sub-activities (in terms of rhythms, activity variants, composition order, etc.) within quite a long period. This fact brings about two main difficulties: excessive/varying length and complex video dynamic/rhythm. To address this, we propose Rhythmic RNN (RhyRNN) which is capable of handling long video sequences (up to 3,000 frames) as well as capturing rhythms at different scales. We also propose two novel modules: diversity-driven pooling (DivPool) and bilinear reweighting (BR), which consistently and hierarchically abstract higher-level information. We study the behavior of RhyRNN and empirically show that our method works well even when mph{only event-level labels are available} in the training stage (compared to algorithms requiring sub-activity labels for recognition), and thus is more practical when the sub-activity labels are missing or difficult to obtain. Extensive experiments on several public datasets demonstrate that, even mph{without fine-tuning the feature backbones}, our method can achieve promising performance for long and complex videos that contain multiple sub-activities."



Paperid:422
Authors:Uttaran Bhattacharya, Christian Roncal, Trisha Mittal, Rohan Chandra , Kyra Kapsaskis, Kurt Gray, Aniket Bera, Dinesh Manocha
Title: Take an Emotion Walk: Perceiving Emotions from Gaits Using Hierarchical Attention Pooling and Affective Mapping
Abstract:
We present an autoencoder-based semi-supervised approach to classify perceived human emotions from walking styles obtained from videos or motion-captured data and represented as sequences of 3D poses. Given the motion on each joint in the pose at each time step extracted from 3D pose sequences, we hierarchically pool these joint motions in a bottom-up manner in the encoder, following the kinematic chains in the human body. We also constrain the latent embeddings of the encoder to contain the space of psychologically-motivated affective features underlying the gaits. We train the decoder to reconstruct the motions per joint per time step in a top-down manner from the latent embeddings. For the annotated data, we also train a classifier to map the latent embeddings to emotion labels. Our semi-supervised approach achieves a mean average precision of 0.84 on the Emotion-Gait benchmark dataset, which contains both labeled and unlabeled gaits collected from multiple sources. We outperform current state-of-art algorithms for both emotion recognition and action recognition from 3D gaits by 7%-23% on the absolute. More importantly, we improve the average precision by 10%-50% on the absolute on classes that each makes up less than 25% of the labeled part of the Emotion-Gait benchmark dataset."



Paperid:423
Authors:Liang Liu, Hao Lu, Hongwei Zou, Haipeng Xiong, Zhiguo Cao, Chunhua Shen
Title: Weighing Counts: Sequential Crowd Counting by Reinforcement Learning
Abstract:
We formulate counting as a sequential decision problem and present a novel crowd counting model solvable by deep reinforcement learning. In contrast to existing counting models that directly output count values, we divide one-step estimation into a sequence of much easier and more tractable sub-decision problems. Such sequential decision nature corresponds exactly to a physical process in reality|scale weighing. Inspired by scale weighing, we propose a novel ‘counting scale’ termed LibraNet where the count value is analogized by weight. By virtually placing a crowd image on one side of a scale, LibraNet (agent) sequentially learns to place appropriate weights on the other side to match the crowd count. At each step, LibraNet chooses one weight (action) from the weight box (the pre-defined action pool) according to the current crowd image features and weights placed on the scale pan (state). LibraNet is required to learn to balance the scale according to the feedback of the needle (Q values). We show that LibraNet exactly implements scale weighing by visualizing the decision process how LibraNet chooses actions. Extensive experiments demonstrate the effectiveness of our design choices and report state-of-the-art results on a few crowd counting benchmarks, including ShanghaiTech, UCF_CC_50 and UCF-QNRF. We also demonstrate good cross-dataset generalization of LibraNet. Code and models are made available at https://git.io/libranet"



Paperid:424
Authors:Yunfei Liu, Xingjun Ma, James Bailey, Feng Lu
Title: Reflection Backdoor: A Natural Backdoor Attack on Deep Neural Networks
Abstract:
Recent studies have shown that DNNs can be compromised by backdoor attacks crafted at training time. A backdoor attack installs a backdoor into the victim model by injecting a backdoor pattern into a small set of training examples. The victim model behaves normally on clean test data, yet consistently predicts a specific (likely incorrect) target class whenever the backdoor pattern is present in an input example. While existing backdoor attacks are effective, they are not stealthy. The modifications made on training data or labels are often suspicious and can be easily detected by simple data filtering or human inspection. In this paper, we present a new type of backdoor attack inspired by an important natural phenomenon: reflection. Using mathematical modeling of physical reflection models, we propose reflection backdoor (Refool) to plant reflections as backdoor into a victim model. We demonstrate on 3 computer vision tasks and 5 datasets that, Refool can attack state-of-the-art DNNs with high success rate, and is more resistant to state-of-the-art backdoor defenses than existing attacks."



Paperid:425
Authors:Yingjun Du, Jun Xu, Huan Xiong, Qiang Qiu, Xiantong Zhen, Cees G. M. Snoek, Ling Shao
Title: Learning to Learn with Variational Information Bottleneck for Domain Generalization
Abstract:
Domain generalization models learn to generalize to previously unseen domains, but suffer from prediction uncertainty and domain shift. In this paper, we address both problems. We introduce a probabilistic meta-learning model for domain generalization, in which classifier parameters shared across domains are modeled as distributions. This enables better handling of prediction uncertainty on unseen domains. To deal with domain shift, we learn domain-invariant representations by the proposed principle of meta variational information bottleneck, we call MetaVIB. MetaVIB is derived from novel variational bounds of mutual information, by leveraging the meta-learning setting of domain generalization. Through episodic training, MetaVIB learns to gradually narrow domain gaps to establish domain-invariant representations, while simultaneously maximizing prediction accuracy. We conduct experiments on three benchmarks for cross-domain visual recognition. Comprehensive ablation studies validate the benefits of MetaVIB for domain generalization. The comparison results demonstrate our method outperforms previous approaches consistently."



Paperid:426
Authors:Ruixuan Yu, Xin Wei, Federico Tombari, Jian Sun
Title: Deep Positional and Relational Feature Learning for Rotation-Invariant Point Cloud Analysis
Abstract:
In this paper we propose a rotation-invariant deep network for point clouds analysis. Point-based deep networks are commonly designed to recognize roughly aligned 3D shapes based on point coordinates, but suffer from performance drops with shape rotations. Some geometric features, e.g., distances and angles of points as inputs of network, are rotation-invariant but lose positional information of points. In this work, we propose a novel deep network for point clouds by incorporating positional information of points as inputs while yielding rotation-invariance. The network is hierarchical and relies on two modules: a positional feature embedding block and a relational feature embedding block. Both modules and the whole network are proven to be rotation-invariant when processing point clouds as input. Experiments show state-of-the-art classification and segmentation performances on benchmark datasets, and ablation studies demonstrate effectiveness of the network design."



Paperid:427
Authors:Gil Shomron, Ron Banner, Moran Shkolnik, Uri Weiser
Title: Thanks for Nothing: Predicting Zero-Valued Activations with Lightweight Convolutional Neural Networks
Abstract:
Convolutional neural networks (CNNs) introduce state-of-the-art results for various tasks with the price of high computational demands. Inspired by the observation that spatial correlation exists in CNN output feature maps (ofms), we propose a method to dynamically predict whether ofm activations are zero-valued or not according to their neighboring activation values, thereby avoiding zero-valued activations and reducing the number of convolution operations. We implement the zero activation predictor (ZAP) with a lightweight CNN, which imposes negligible overheads and is easy to deploy on existing models. ZAPs are trained by mimicking hidden layer ouputs; thereby, enabling a parallel and label-free training. Furthermore, without retraining, each ZAP can be tuned to a different operating point trading accuracy for MAC reduction."



Paperid:428
Authors:Zixuan Chen, Zhihui Xie, Junchi Yan Yinqiang Zheng, Xiaokang Yang
Title: Layered Neighborhood Expansion for Incremental Multiple Graph Matching
Abstract:
Graph matching has been a fundamental problem in computer vision and pattern recognition, for its practical flexibility as well as NP hardness challenge. Though the matching between two graphs and among multiple graphs have been intensively studied in literature, the online setting for incremental matching of a stream of graphs has been rarely considered. In this paper, we treat the graphs as graphs on a super-graph, and propose a novel breadth first search based method for expanding the neighborhood on the super-graph for a new coming graph, such that the matching with the new graph can be efficiently performed within the constructed neighborhood. Then depth first search is performed to update the overall pairwise matchings. Moreover, we show our approach can also be readily used in the batch mode setting, by adaptively determining the order of coming graph batch for matching, still under the neighborhood expansion based incremental matching framework. Extensive experiments on both online and offline matching of graph collections show our approach’s state-of-the-art accuracy and efficiency. The source code will be made public available."



Paperid:429
Authors:Wouter Van Gansbeke, Simon Vandenhende, Stamatios Georgoulis, Marc Proesmans, Luc Van Gool
Title: SCAN: Learning to Classify Images without Labels
Abstract:
Can we automatically group images into semantically meaningful clusters when ground-truth annotations are absent? The task of unsupervised image classification remains an important, and open challenge in computer vision. Several recent approaches have tried to tackle this problem in an end-to-end fashion. In this paper, we deviate from recent works, and advocate a two-step approach where feature learning and clustering are decoupled. First, a self-supervised task from representation learning is employed to obtain semantically meaningful features. Second, we use the obtained features as a prior in a learnable clustering approach. In doing so, we remove the ability for cluster learning to depend on low-level features, which is present in current end-to-end learning approaches. Experimental evaluation shows that we outperform state-of-the-art methods by large margins, in particular +26.6% on CIFAR10, +25.0% on CIFAR100-20 and +21.3% on STL10 in terms of classification accuracy. Furthermore, our method is the first to perform well on a large-scale dataset for image classification. In particular, we obtain promising results on ImageNet, and outperform several semi-supervised learning methods in the low-data regime without the use of any ground-truth annotations."



Paperid:430
Authors:Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, Ondřej Chum, Cordelia Schmid
Title: Graph convolutional networks for learning with few clean and many noisy labels
Abstract:
In this work we consider the problem of learning a classifier from noisy labels when a few clean labeled examples are given. The structure of clean and noisy data is modeled by a graph per class and Graph Convolutional Networks (GCN) are used to predict class relevance of noisy examples. For each class, the GCN is treated as a binary classifier, which learns to discriminate clean from noisy examples using a weighted binary cross-entropy loss function. The GCN-inferred ""clean"" probability is then exploited as a relevance measure. Each noisy example is weighted by its relevance when learning a classifier for the end task. We evaluate our method on an extended version of a few-shot learning problem, where the few clean examples of novel classes are supplemented with additional noisy data. Experimental results show that our GCN-based cleaning process significantly improves the classification accuracy over not cleaning the noisy data, as well as standard few-shot classification where only few clean examples are used."



Paperid:431
Authors:Yuankai Qi, Zizheng Pan, Shengping Zhang, Anton van den Hengel, Qi Wu
Title: Object-and-Action Aware Model for Visual Language Navigation
Abstract:
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions, on the basis of visible environments. This requires to extract value from two very different types of natural-language information. The first is object description (e.g., `table', `door'), each presenting as a tip for the agent to determine the next action by finding the item visible in the environment, and the second is action specification (e.g., `go straight', `turn left') which allows the robot to directly predict the next movements without relying on visual perceptions. However, most existing methods pay few attention to distinguish these information from each other during instruction encoding and mix together the matching between textual object/action encoding and visual perception/orientation features of candidate viewpoints. In this paper, we propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately. This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly. However, one side-issue caused by above solution is that an object mentioned in instructions may be observed in the direction of two or more candidate viewpoints, thus the OAAM may not predict the viewpoint on the shortest path as the next action. To handle this problem, we design a simple but effective path loss to penalize trajectories deviating from the ground truth path. Experimental results demonstrate the effectiveness of the proposed model and path loss, and the superiority of their combination with a 50% SPL score on the R2R dataset and a 40% CLS score on the R4R dataset in unseen environments, outperforming the previous state-of-the-art."



Paperid:432
Authors:Kenkun Liu, Rongqi Ding, Zhiming Zou, Le Wang, Wei Tang
Title: A Comprehensive Study of Weight Sharing in Graph Networks for 3D Human Pose Estimation
Abstract:
Graph convolutional networks (GCNs) have been applied to 3D human pose estimation (HPE) from 2D body joint detections and have shown encouraging performance. One limitation of the vanilla graph convolution is that it models the relationships between neighboring nodes via a shared weight matrix. This is suboptimal for articulated body modeling as the relations between different body joints are different. The objective of this paper is to have a comprehensive and systematic study of weight sharing in GCNs for 3D HPE. We first show there are two different ways to interpret a GCN depending on whether feature transformation occurs before or after feature aggregation. These two interpretations lead to five different weight sharing methods, and three more variants can be derived by decoupling the self-connections with other edges. We conduct extensive ablation study on these weight sharing methods under controlled settings and obtain new conclusions that will benefit the community."



Paperid:433
Authors:Wenbo Li, Xin Tao, Taian Guo, Lu Qi, Jiangbo Lu, Jiaya Jia
Title: MuCAN: Multi-Correspondence Aggregation Network for Video Super-Resolution
Abstract:
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame. In this process, inter- and intra-frames are the key sources for exploiting temporal and spatial information. However, there are a couple of limitations for existing VSR methods. First, optical flow is often used to establish one-on-one temporal correspondences. But flow estimation itself is error-prone and hence largely affects the ultimate recovery result. Second, similar patterns existing in natural images are rarely exploited for the VSR task. Motivated by these findings, we propose a temporal multi-correspondence aggregation strategy to leverage most similar patches across frames, and also a cross-scale nonlocal-correspondence aggregation scheme to explore self-similarity of images across scales. Based on these two novel modules, we build an effective multi-correspondence aggregation network (MuCAN) for VSR. Our method achieves state-of-the-art results on multiple benchmark datasets. Extensive experiments justify the effectiveness of our method."



Paperid:434
Authors:Yifan Liu, Chunhua Shen, Changqian Yu, Jingdong Wang
Title: Efficient Semantic Video Segmentation with Per-frame Inference
Abstract:
For semantic segmentation, most existing real-time deep mod-els trained with each frame independently may produce inconsistent results when tested on a video sequence. A few methods take the correlations in the video sequence into account, e.g., by propagating the results to the neighboring frames using optical flow or extracting frame representations using multi-frame information, which may lead to inaccurate results or unbalanced latency. In contrast, here we explicitly consider the temporal consistency among frames as extra constraints during training and process each frame independently in the inference phase. Thus no computation overhead is introduced for inference. Compact models are employed for real-time execution. To narrow the performance gap between compact models and large models, new temporal knowledge distillation methods are designed. Weighing among accuracy, temporal smoothness, and efficiency, our proposed method outperforms previous keyframe based methods and corresponding baselines which are trained with each frame independently on benchmark datasets including cityscapes and Camvid."



Paperid:435
Authors:Christoph Kamann, Carsten Rother
Title: Increasing the Robustness of Semantic Segmentation Models with Painting-by-Numbers
Abstract:
For safety-critical applications such as autonomous driving, CNNs have to be robust with respect to unavoidable image corruptions, such as image noise. While previous works addressed the task of robust prediction in the context of full-image classification, we consider it for dense semantic segmentation. We build upon an insight from image classification that output robustness can be improved by increasing the network-bias towards object shapes. We present a new training schema that increases this shape bias. Our basic idea is to alpha-blend a portion of the RGB training images with faked images, where each class-label is given a fixed, randomly chosen color that is not likely to appear in real imagery. This forces the network to rely more strongly on shape cues. We call this data augmentation technique ""Painting-by-Numbers''. We demonstrate the effectiveness of our training schema for DeepLabv3$+$ with various network backbones, MobileNet-V2, ResNets, and Xception, and evaluate it on the Cityscapes dataset. With respect to our 16 different types of image corruptions and 5 different network backbones, we are in 74% better than training with clean data. For cases where we are worse than a model trained without our training schema, it is mostly only marginally worse. However, for some image corruptions such as images with noise, we see a considerable performance gain of up to 25%."



Paperid:436
Authors:Bing Han, Kaushik Roy
Title: Deep Spiking Neural Network: Energy Efficiency Through Time based Coding
Abstract:
Spiking Neural Networks (SNNs) are promising for enabling low-power event-driven data analytics. The best performing SNNs for image recognition tasks are obtained by converting a trained deep learning Analog Neural Network (ANN) composed of Rectified Linear Unit (ReLU) activation to SNN consisting of Integrate-and-Fire (IF) neurons with ""proper"" firing thresholds. However, this has come at the cost of accuracy loss and higher inference latency due to lack of a notion of time. In this work, we propose an ANN to SNN conversion methodology that uses a time-based coding scheme, named Temporal-Switch-Coding (TSC), and a corresponding TSC spiking neuron model. Each input image pixel is presented using two spikes and the timing between the two spiking instants is proportional to the pixel intensity. The real-valued ReLU activations in ANN are encoded using the spike-times of the TSC neurons in the converted TSC-SNN. At most two memory accesses and two addition operations are performed for each synapse during the whole inference, which significantly improves the SNN energy efficiency. We demonstrate the proposed TSC-SNN for VGG-16, ResNet-20 and ResNet-34 SNNs on datasets including CIFAR-10 (93.63% top-1), CIFAR-100 (70.97% top-1) and ImageNet (73.46% top-1 accuracy). It surpasses the best inference accuracy of the converted rate-encoded SNN with 7-14.5 times lesser inference latency, and 30-60 times fewer addition operations and memory accesses per inference across datasets."



Paperid:437
Authors:Jun Wang, Shiyi Lan, Mingfei Gao, Larry S. Davis
Title: InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic Information Modeling
Abstract:
Real-time 3D object detection is crucial for autonomous cars. Achieving promising performance with high efficiency, voxel-based approaches have received considerable attention. However, previous methods model the input space with features extracted from equally divided sub-regions without considering that point cloud is generally non-uniformly distributed over the space. To address this issue, we propose a novel 3D object detection framework with dynamic information modeling. The proposed framework is designed in a coarse-to-fine manner. Coarse predictions are generated in the first stage via a voxel-based region proposal network. We introduce InfoFocus, which improves the coarse detections by adaptively refining features guided by the information of point cloud density. Experiments are conducted on the large-scale nuScenes 3D detection benchmark. Results show that our framework achieves the state-of-the-art performance with 31 FPS and improves our baseline significantly by 9.0% mAP on the nuScenes test set."



Paperid:438
Authors:Poojan Oza, Vishal M. Patel
Title: Utilizing Patch-level Category Activation Patterns for Multiple Class Novelty Detection
Abstract:
For any recognition system, the ability to identify novel class samples during inference is an important aspect of the system’s robustness. This problem of detecting novel class samples during inference is commonly referred to as Multiple Class Novelty Detection. In this paper, we propose a novel method that makes deep convolutional neural networks robust to novel classes. Specifically, during training one branch performs traditional classification (referred to as global inference), and the other branch provides patch-level information to keep track of the class-specific activation patterns (referred to as local inference). Both global and local branch information are combined to train a novelty detection network, which is used during inference to identify novel classes. We evaluate the proposed method on four datasets (Caltech256, CUB-200, Stanford Dogs and FounderType-200) and show that the proposed method is able to identify novel class samples better compared to the other deep convolutional neural network based methods."



Paperid:439
Authors:Yifan Wang, Brian L. Curless, Steven M. Seitz
Title: People as Scene Probes
Abstract:
By analyzing the motion of people and other objects in a scene, we demonstrate how to infer depth, occlusion, lighting, and shadow information from video taken from a single camera viewpoint. This information is then used to composite new objects into the same scene with a high degree of automation and realism. In particular, when a user places a new object (2D cut-out) in the image, it is automatically rescaled, relit, occluded properly, and casts realistic shadows in the correct direction relative to the sun, and which conform properly to scene geometry. We demonstrate results (best viewed in supplementary video) on a range of scenes and compare to alternative methods for depth estimation and shadow compositing."



Paperid:440
Authors:Lei Yang, Wenxi Liu, Zhiming Cui, Nenglun Chen, Wenping Wang
Title: Mapping in a Cycle: Sinkhorn Regularized Unsupervised Learning for Point Cloud Shapes
Abstract:
We propose an unsupervised learning framework with the pretext task of finding dense correspondences between point cloud shapes from the same category based on the cycle-consistency formulation. In order to learn discriminative pointwise features from point cloud data, we incorporate in the formulation a regularization term based on Sinkhorn normalization to enhance the learned pointwise mappings to be as bijective as possible. Besides, a random rigid transform of the source shape is introduced to form a triplet cycle to improve the model's robustness against perturbations. Comprehensive experiments demonstrate that the learned pointwise features through our framework benefits various point cloud analysis tasks, e.g. partial shape registration and keypoint transfer. We also show that the learned pointwise features can be leveraged by supervised methods to improve the part segmentation performance with either the full training dataset or just a small portion of it."



Paperid:441
Authors:Matheus Gadelha, Aruni RoyChowdhury, Gopal Sharma, Evangelos Kalogerakis, Liangliang Cao, Erik Learned-Miller, Rui Wang, Subhransu Maji
Title: Label-Efficient Learning on Point Clouds using Approximate Convex Decompositions
Abstract:
The problems of shape classification and part segmentation from 3D point clouds have garnered increasing attention in the last few years. Both of these problems, however, suffer from relatively small training sets, creating the need for statistically efficient methods to learn 3D shape representations. In this paper, we investigate the use of Approximate Convex Decompositions (ACD) as a self-supervisory signal for label-efficient learning of point cloud representations. We show that using ACD to approximate ground truth segmentation provides excellent self-supervision for learning 3D point cloud representations that are highly effective on downstream tasks. We report improvements over the state-of-the-art for unsupervised representation learning on the ModelNet40 shape classification dataset and significant gains in few-shot part segmentation on the ShapeNetPart dataset. Our source code is publicly available."



Paperid:442
Authors:Tiancheng Zhi, Christoph Lassner, Tony Tung, Carsten Stoll, Srinivasa G. Narasimhan, Minh Vo
Title: TexMesh: Reconstructing Detailed Human Texture and Geometry from RGB-D Video
Abstract:
We present TexMesh, a novel approach to reconstruct detailed human meshes with high-resolution full-body texture from RGB-D video. TexMesh enables high quality free-viewpoint rendering of humans. Given the RGB frames, the captured environment map, and the coarse per-frame human mesh from RGB-D tracking, our method reconstructs spatiotemporally consistent and detailed per-frame meshes along with a high-resolution albedo texture. By using the incident illumination we are able to accurately estimate local surface geometry and albedo, which allows us to further use photometric constraints to adapt a synthetically trained model to real-world sequences in a self-supervised manner for detailed surface geometry and high-resolution texture estimation. In practice, we train our models on a short example sequence for self-adaptation and the model runs at interactive framerate afterwards. We validate TexMesh on synthetic and real-world data, and show it outperforms the state of art quantitatively and qualitatively."



Paperid:443
Authors:Mingfei Gao, Zizhao Zhang, Guo Yu, Sercan . Arık, Larry S. Davis, Tomas Pfister
Title: Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost
Abstract:
Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in most conventional methods. Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. First, we exploit both labeled and unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate the superior performance of our proposed method with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we also study an important yet under-explored problem -- ``When can we start learning-based AL selection?"". We propose a measure that is empirically correlated with the AL target loss and is potentially useful for determining the proper starting point of learning-based AL methods."



Paperid:444
Authors:Fangyun Wei, Xiao Sun, Hongyang Li, Jingdong Wang, Stephen Lin
Title: Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation
Abstract:
Instance Segmentation and Pose Estimation","A recent approach for object detection and human pose estimation is to regress bounding boxes or human keypoints from a central point on the object or person. While this center-point regression is simple and efficient, we argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries, due to object deformation and scale/orientation variation. To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions. This point set is arranged to reflect a good initialization for the given task, such as modes in the training data for pose estimation, which lie closer to the ground truth than the central point and provide more informative features for regression. As the utility of a point set depends on how well its scale, aspect ratio and rotation matches the target, we adopt the anchor box technique of sampling these transformations to generate additional point-set candidates. We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation. Our results show that this general-purpose approach can achieve performance competitive with state-of-the-art methods for each of these tasks."



Paperid:445
Authors:Cheng Lin, Tingxiang Fan, Wenping Wang, Matthias Nießner
Title: Modeling 3D Shapes by Reinforcement Learning
Abstract:
We explore how to enable machines to model 3D shapes like human modelers using deep reinforcement learning (RL). In 3D modeling software like Maya, a modeler usually creates a mesh model in two steps: (1) approximating the shape using a set of primitives; (2) editing the meshes of the primitives to create detailed geometry. Inspired by such artist-based modeling, we propose a two-step neural framework based on RL to learn 3D modeling policies. By taking actions and collecting rewards in an interactive environment, the agents first learn to parse a target shape into primitives and then to edit the geometry. To effectively train the modeling agents, we introduce a novel training algorithm that combines heuristic policy, imitation learning and reinforcement learning. Our experiments show that the agents can learn good policies to produce regular and structure-aware mesh models, which demonstrates the feasibility and effectiveness of the proposed RL framework."



Paperid:446
Authors:Lida Li, Kun Wang, Shuai Li, Xiangchu Feng, Lei Zhang
Title: LST-Net: Learning a Convolutional Neural Network with a Learnable Sparse Transform
Abstract:
The 2D convolutional (Conv2d) layer is the fundamental element to a deep convolutional neural network (CNN). Despite the great success of CNN, the conventional Conv2d is still limited in effectively reducing the spatial and channel-wise redundancy of features. In this paper, we propose to mitigate this issue by learning a CNN with a learnable sparse transform (LST), which converts the input features into a more compact and sparser domain so that the spatial and channel-wise redundancy can be more effectively reduced. The proposed LST can be efficiently implemented with existing CNN modules, such as point-wise and depth-wise separable convolutions, and it is portable to existing CNN architectures for seamless training and inference.We further present a hybrid soft thresholding and ReLU (ST-ReLU) activation scheme, making the trained network, namely LST-Net, more robust to image corruptions at the inference stage. Extensive experiments on CIFAR-10/100, ImageNet, ImageNet-C and Places365-Standard datasets validated that the proposed LST-Net can obtain even higher accuracy than its counterpart networks with fewer parameters and less overhead."



Paperid:447
Authors:Damien Teney, Ehsan Abbasnedjad, Anton van den Hengel
Title: Learning What Makes a Difference from Counterfactual Examples and Gradient Supervision
Abstract:
One of the primary challenges limiting the practical application of deep learning is its susceptibility to learning spurious correlations in the data, rather than capturing the data-generating mechanisms of the task of interest. The resulting failure to generalise cannot be addressed by simply using more data from the same distribution. We propose an auxiliary training objective that improves the generalization capabilities of neural networks by leveraging an overlooked supervisory signal found in existing datasets. We demonstrate that pairs of minimally-different examples with different labels, a.k.a counterfactual examples, provide a signal indicative of the underlying causal structure of the task. We show that such pairs can be identified in a number of existing datasets for a range of tasks in vision (visual question answering, multi-label image classification) and natural language processing (sentiment analysis, natural language inference). We propose a training objective that orients the gradient on a model's decision boundary to align with pairwise relations in the input domain. Models trained with this technique demonstrate improved performance on out-of-distribution test sets."



Paperid:448
Authors:Zetong Yang, Yanan Sun, Shu Liu, Xiaojuan Qi, Jiaya Jia
Title: CN: Channel Normalization For Point Cloud Recognition
Abstract:
In 3D recognition, to fuse multi-scale structure information, existing methods apply hierarchical frameworks stacked by multiple fusion layers for integrating current relative locations with structure information from the previous level. In this paper, we deeply analyze these point recognition frameworks and present a factor, called difference ratio, to measure the influence of structure information among different levels on the final representation. We discover that structure information in deeper layers is overwhelmed by information in shallower layers in generating the final features, which prevents the model from understanding the point cloud in a global view. Inspired by this observation, we propose a novel channel normalization scheme to balance structure information among different layers and avoid excessive accumulation of shallow information, which benefits the model in exploiting and integrating multilayer structure information. We evaluate our channel normalization in several core 3D recognition tasks including classification, segmentation and detection. Experimental results show that our channel normalization further boosts the performance of state-of-the-art methods effectively."



Paperid:449
Authors:Ning Zhang, Junchi Yan
Title: Rethinking the Defocus Blur Detection Problem and A Real-Time Deep DBD Model
Abstract:
Defocus blur detection (DBD) is a classical low level vision task. It has recently attracted attention focusing on designing complex convolutional neural networks (CNN) which make full use of both low level features and high level semantic information. The heavy networks used in these methods lead to low processing speed, resulting difficulty in applying to real-time applications. In this work, we propose novel perspectives on the DBD problem and design convenient approach to build a real-time cost-effective DBD model. First, we observe that the semantic information does not always relate to and sometimes mislead the blur detection. We start from the essential characteristics of the DBD problem and propose a data augmentation method accordingly to destroy the semantic information and enforce the model to learn image blur related features rather than the semantic features. A novel self-supervision training objective is proposed to enhance the model training consistency and stability. Second, by rethinking the relationship between defocus blur detection and salience detection, we identify two previously ignored but common scenarios, based on which we design a hard mining strategy to enhance the DBD model. By using the proposed techniques, our model that uses a slightly modified U-Net as backbone, improves the processing speed by more than 3 times and performs competitively against state of the art methods. Ablation study is also conducted to verify the effectiveness of each part of our proposed methods."



Paperid:450
Authors:Jianchao Zhu, Liangliang Shi, Junchi Yan, Hongyuan Zha
Title: AutoMix: Mixup Networks for Sample Interpolation via Cooperative Barycenter Learning
Abstract:
This paper proposes new ways of sample mixing by thinking of the process as generation of barycenter in a metric space for data augmentation. First, we present an optimal-transport-based mixup technique to generate Wasserstein barycenter which works well on images with clean background and is empirically shown complementary to existing mixup methods. Then we generalize mixup to an AutoMix technique by using a learnable network to fit barycenter in a cooperative way between the classifier (a.k.a. discriminator) and generator networks. Experimental results on both multi-class and multi-label prediction tasks show the efficacy of our approach, which is also verified in the presence of unseen categories (open set) and noise."



Paperid:451
Authors:Wenjia Wang, Enze Xie, Xuebo Liu, Wenhai Wang, Ding Liang, Chunhua Shen, Xiang Bai
Title: Scene Text Image Super-resolution in the wild
Abstract:
Low-resolution text images are often seen in natural scenes such as documents captured by mobile phones. Recognizing low-resolution text images is challenging because they lose detailed content information, leading to poor recognition accuracy. An intuitive solution is to introduce super-resolution (SR) techniques as pre-processing. However, previous single image super-resolution (SISR) methods are trained on synthetic low resolution images (e.g.Bicubic down-sampling), which is simple and not suitable for real low-resolution text recognition. To this end, we propose a real scene text SR dataset, termed TextZoom. It contains paired real low-resolution and high-resolution images which are captured by cameras with different focal length in the wild. It is more authentic and challenging than synthetic data, as shown in Fig. 1. We argue improving the recognition accuracy is the ultimate goal for SceneText SR. In this purpose, a new Text Super-Resolution Network, termed TSRN, with three novel modules is developed. (1) A sequential residual block is proposed to extract the sequential information of the text images. (2) A boundary-aware loss is designed to sharpen the character boundaries.(3) A central alignment module is proposed to relieve the misalignment problem in TextZoom. Extensive experiments on TextZoom demonstrate that our TSRN largely improves the recognition accuracy by over 13%of CRNN, and by nearly 9.0% of ASTER and MORAN compared to synthetic SR data. Furthermore, our TSRN clearly outperforms 7 state-of-the-art SR methods in boosting the recognition accuracy of LR images in TextZoom. For example, it outperforms LapSRN by over 5% and 8%on the recognition accuracy of ASTER and CRNN. Our results suggest that low-resolution text recognition in the wild is far from being solved, thus more research effort is needed"



Paperid:452
Authors:Omid Poursaeed, Matthew Fisher, Noam Aigerman, Vladimir G. Kim
Title: Coupling Explicit and Implicit Surface Representations for Generative 3D Modeling
Abstract:
We propose a novel neural architecture for representing 3D surfaces, which harnesses two complementary shape representations: (i) an explicit representation via an atlas, i.e., embeddings of 2D domains into 3D; (ii) an implicit-function representation, i.e., a scalar function over the 3D volume, with its levels denoting surfaces. We make these two representations synergistic by introducing novel consistency losses that ensure that the surface created from the atlas aligns with the level-set of the implicit function. Our hybrid architecture outputs results which are superior to the output of the two equivalent single-representation networks, yielding smoother explicit surfaces with more accurate normals, and a more accurate implicit occupancy function. Additionally, our surface reconstruction step can directly leverage the explicit atlas-based representation. This process is computationally efficient, and can be directly used by differentiable rasterizers, enabling training our hybrid representation with image-based losses."



Paperid:453
Authors:Xinqi Zhu, Chang Xu, Dacheng Tao
Title: Learning Disentangled Representations with Latent Variation Predictability
Abstract:
Latent traversal is a popular approach to visualize the disentangled latent representations. Given a bunch of variations in a single unit of the latent representation, it is expected that there is a change in a single factor of variation of the data while others are fixed. However, this impressive experimental observation is rarely explicitly encoded in the objective function of learning disentangled representations. This paper defines the variation predictability of latent disentangled representations. Given image pairs generated by latent codes varying in a single dimension, this varied dimension could be closely correlated with these image pairs if the representation is well disentangled. Within an adversarial generation process, we encourage variation predictability by maximizing the mutual information between latent variations and corresponding image pairs. We further develop an evaluation metric that does not rely on the ground-truth generative factors to measure the disentanglement of latent representations. The proposed variation predictability is a general constraint that is applicable to the VAE and GAN frameworks for boosting disentanglement of latent representations. Experiments show that the proposed variation predictability correlates well with existing ground-truth-required metrics and the proposed algorithm is effective for disentanglement learning."



Paperid:454
Authors:Jaeyeon Kang, Younghyun Jo, Seoung Wug Oh, Peter Vajda, Seon Joo Kim
Title: Deep Space-Time Video Upsampling Networks
Abstract:
Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, independently. This is highly inefficient as heavy deep neural networks (DNN) are involved in each solution. To this end, we propose an end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework. In our framework, a novel weighting scheme is proposed to fuse input frames effectively without explicit motion compensation for efficient processing of videos. The results show better results both quantitatively and qualitatively, while reducing the computation time (x7 faster) and the number of parameters (30%) compared to baselines. Our source code is available at https://github.com/JaeYeonKang/STVUN-Pytorch."



Paperid:455
Authors:Shuo Wang, Jun Yue, Jianzhuang Liu, Qi Tian, Meng Wang
Title: Large-Scale Few-Shot Learning via Multi-Modal Knowledge Discovery
Abstract:
Large-scale few-shot learning aims at identifying hundreds of novel object categories where each category has only a few samples. It is a challenging problem since (1) the identifying process is susceptible to over-fitting with limited samples of an object, and (2) the sample imbalance between a base (known knowledge) category and a novel category is easy to bias the recognition results. To solve these problems, we propose a method based on multi-modal knowledge discovery. First, we use the visual knowledge to help the feature extractors focus on different visual parts. Second, we design a classifier to learn the distribution over all categories. In the second stage, we develop three schemes to minimize the prediction error and balance the training procedure: (1) Hard labels are used to provide precise supervision. (2) Semantic textual knowledge is utilized as weak supervision to find the potential relations between the novel and the base categories. (3) An imbalance control is presented from the data distribution to alleviate the recognition bias towards the base categories. We apply our method on three benchmark datasets, and it achieves state-of-the-art performances in all the experiments."



Paperid:456
Authors:Yu Li, Zhuoran Shen, Ying Shan
Title: Fast Video Object Segmentation using the Global Context Module
Abstract:
We developed a real-time, high-quality semi-supervised video object segmentation algorithm. Its accuracy is on par with the most accurate, time-consuming online-learning model, while its speed is similar to the fastest template-matching method with sub-optimal accuracy. The core component of the model is a novel global context module that effectively summarizes and propagates information through the entire video. Compared to previous approaches that only use one frame or a few frames to guide the segmentation of the current frame, the global context module uses all past frames. Unlike the previous state-of-the-art space-time memory network that caches a memory at each spatio-temporal position, the global context module uses a fixed-size feature representation. Therefore, it uses constant memory regardless of the video length and costs substantially less memory and computation. With the novel module, our model achieves top performance on standard benchmarks at a real-time speed."



Paperid:457
Authors:Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Title: Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
Abstract:
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind. A major contributing factor has been the prohibitive cost of annotating videos frame-by-frame. In this paper, we present a spatio-temporal action recognition model that is trained with only video-level labels, which are significantly easier to annotate. Our method leverages per-frame person detectors which have been trained on large image datasets within a Multiple Instance Learning framework. We show how we can apply our method in cases where the standard Multiple Instance Learning assumption, that each bag contains at least one instance with the specified label, is invalid using a novel probabilistic variant of MIL where we estimate the uncertainty of each prediction. Furthermore, we report the first weakly-supervised results on the AVA dataset and state-of-the-art results among weakly-supervised methods on UCF101-24."



Paperid:458
Authors:Nikita Dvornik, Cordelia Schmid, Julien Mairal
Title: Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification
Abstract:
Popular approaches for few-shot classification consist of first learning a generic data representation based on a large annotated dataset, before adapting the representation to new classes given only a few labeled samples. In this work, we propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches. First, we obtain a multi-domain representation by training a set of semantically different feature extractors. Then, given a few-shot learning task, we use our multi-domain feature bank to automatically select the most relevant representations. We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training, which leads to state-of-the-art results on MetaDataset and improved accuracy on mini-ImageNet."



Paperid:459
Authors:Zhongang Cai, Junzhe Zhang, Daxuan Ren, Cunjun Yu, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, Chen Change Loy
Title: MessyTable: Instance Association in Multiple Camera Views
Abstract:
We present an interesting and challenging dataset that features a large number of scenes with messy tables captured from multiple camera views. Each scene in this dataset is highly complex, containing multiple object instances that could be identical, stacked and occluded by other instances. The key challenge is to associate all instances given the RGB image of all views. The seemingly simple task surprisingly fails many popular methods or heuristics that we assume good performance in object association. The dataset challenges existing methods in mining subtle appearance differences, reasoning based on contexts, and fusing appearance with geometric cues for establishing an association. We report interesting findings with some popular baselines, and discuss how this dataset could help inspire new problems and catalyse more robust formulations to tackle real-world instance association problems. ootnote{project page: extcolor{magenta}{\url{https://caizhongang.github.io/projects/MessyTable/}}.}"



Paperid:460
Authors:Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, Dahua Lin
Title: A Unified Framework for Shot Type Classification Based on Subject Centric Lens
Abstract:
In film making, shot has a profound influence on how the story is delivered and how the audiences are echoed. As different scale and movement types of shots can express different emotions and contents, recognizing shots and their attributes is important to the understanding of movies as well as general videos. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and the video camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as maps to guide scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods. "



Paperid:461
Authors:Samuel Albanie, Gül Varol, Liliane Momeni, Triantafyllos Afouras, Joon Son Chung, Neil Fox, Andrew Zisserman
Title: BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
Abstract:
Recent progress in fine-grained gesture and action classification, and machine translation, point to the possibility of automated sign language recognition becoming a reality. A key stumbling block in making progress towards this goal is a lack of appropriate training data, stemming from the high complexity of sign annotation and a limited supply of qualified annotators. In this work, we introduce a new scalable approach to data collection for sign recognition in continuous videos. We make use of weakly-aligned subtitles for broadcast footage together with a keyword spotting method to automatically localise sign-instances for a vocabulary of 1,000 signs in 1,000 hours of video. We make the following contributions: (1) We show how to use mouthing cues from signers to obtain high-quality annotations from video data - the result is the BSL-1K dataset, a collection of British Sign Language (BSL) signs of unprecedented scale; (2) We show that we can use BSL-1K to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks---we exceed the state of the art on both the MSASL and WLASL benchmarks. Finally, (3) we propose new large-scale evaluation sets for the tasks of sign recognition and sign spotting and provide baselines which we hope will serve to stimulate research in this area."



Paperid:462
Authors:Neng Qian, Jiayi Wang, Franziska Mueller, Florian Bernard, Vladislav Golyanik, Christian Theobalt
Title: HTML: A Parametric Hand Texture Model for 3D Hand Reconstruction and Personalization
Abstract:
3D hand reconstruction from images is a widely-studied problem in computer vision and graphics, and has a particularly high relevance for virtual and augmented reality. Although several 3D hand reconstruction approaches leverage hand models as a strong prior to resolve ambiguities and achieve more robust results, most existing models account only for the hand shape and poses and do not model the texture. To fill this gap, in this work we present HTML, the first parametric texture model of human hands. Our model spans several dimensions of hand appearance variability (e.g. related to gender, ethnicity, or age) and only requires a commodity camera for data acquisition. Experimentally, we demonstrate that our appearance model can be used to tackle a range of challenging problems such as 3D hand reconstruction from a single monocular image. Furthermore, our appearance model can be used to define a neural rendering layer that enables training with a self-supervised photometric loss. We make our model publicly available."



Paperid:463
Authors:Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, Shengjin Wang
Title: CycAs: Self-supervised Cycle Association for Learning Re-identifiable Descriptions
Abstract:
This paper proposes a self-supervised learning method for the person re-identification (re-ID) problem, where existing unsupervised methods usually rely on pseudo labels, such as those from video tracklets or clustering. A potential drawback of using pseudo labels is that errors may accumulate and it is challenging to estimate the number of pseudo IDs. We introduce a different unsupervised method that allows us to learn pedestrian embeddings from raw videos, without resorting to pseudo labels. The goal is to construct a self-supervised pretext task that matches the person re-ID objective. Inspired by the mph{data association} concept in multi-object tracking, we propose the extbf{Cyc}le extbf{As}sociation ( extbf{CycAs}) task: after performing data association between a pair of video frames forward and then backward, a pedestrian instance is supposed to be associated to itself. To fulfill this goal, the model must learn a meaningful representation that can well describe correspondences between instances in frame pairs. We adapt the discrete association process to a differentiable form, such that end-to-end training becomes feasible. Experiments are conducted in two aspects: We first compare our method with existing unsupervised re-ID methods on seven benchmarks and demonstrate CycAs' superiority. Then, to further validate the practical value of CycAs in real-world applications, we perform training on self-collected videos and report promising performance on the standard test sets."



Paperid:464
Authors:Xihui Liu, Zhe Lin, Jianming Zhang, Handong Zhao, Quan Tran, Xiaogang Wang, Hongsheng Li
Title: Open-Edit: Open-Domain Image Manipulation with Open-Vocabulary Instructions
Abstract:
We propose a novel algorithm, named Open-Edit, which is the first attempt on open-domain image manipulation with open-vocabulary instructions. It is a challenging task considering the large variation of image domains and the lack of training supervision. Our approach takes advantage of the unified visual-semantic embedding space pretrained on a general image-caption dataset, and manipulates the embedded visual features by applying text-guided vector arithmetic on the image feature maps. A structure-preserving image decoder then generates the manipulated images from the manipulated feature maps. We further propose an on-the-fly sample-specific optimization approach with cycle-consistency constraints to regularize the manipulated images and force them to preserve details of the source images. Our approach shows promising results in manipulating open-vocabulary color, texture, and high-level attributes for various scenarios of open-domain images."



Paperid:465
Authors:Zhongdao Wang, Liang Zheng, Yixuan Liu, Yali Li, Shengjin Wang
Title: Towards Real-Time Multi-Object Tracking
Abstract:
Modern multiple object tracking (MOT) systems usually follow the tracking-by-detection paradigm. It has 1) a detection model for target localization and 2) an appearance embedding model for data association. Having the two models separately executed might lead to efficiency problems, as the running time is simply a sum of the two steps without investigating potential structures that can be shared between them. Existing research efforts on real-time MOT usually focus on the association step, so they are essentially real-time association methods but not real-time MOT system. In this paper, we propose an MOT system that allows target detection and appearance embedding to be learned in a shared model. Specifically, we incorporate the appearance embedding model into a single-shot detector, such that the model can simultaneously output detections and the corresponding embeddings. We further propose a simple and fast association method that works in conjunction with the joint model. In both components the computation cost is significantly reduced compared with former MOT systems, resulting in a neat and fast baseline for future follow-ups on real-time MOT algorithm design. To our knowledge, this work reports the first (near) real-time MOT system, with a running speed of 22 to 40 FPS depending on the input resolution. Meanwhile, its tracking accuracy is comparable to the state-of-the-art trackers embodying separate detection and embedding (SDE) learning ($64.4\%$ MOTA s $66.1\%$ MOTA on MOT-16 challenge). Code and models will be available at \url{https://github.com/Zhongdao/Towards-Realtime-MOT}."



Paperid:466
Authors:Jian Liang, Yunbo Wang, Dapeng Hu, Ran He, Jiashi Feng
Title: A Balanced and Uncertainty-aware Approach for Partial Domain Adaptation
Abstract:
This work addresses the unsupervised domain adaptation problem, especially in the case of class labels in the target domain being only a subset of those in the source domain. Such a partial transfer setting is realistic but challenging and existing methods always suffer from two key problems, negative transfer and uncertainty propagation. In this paper, we build on domain adversarial learning and propose a novel domain adaptation method BA$^3$US with two new techniques termed Balanced Adversarial Alignment (BAA) and Adaptive Uncertainty Suppression (AUS), respectively. On one hand, negative transfer results in misclassification of target samples to the classes only present in the source domain. To address this issue, BAA pursues the balance between label distributions across domains in a fairly simple manner. Specifically, it randomly leverages a few source samples to augment the smaller target domain during domain alignment so that classes in different domains are symmetric. On the other hand, a source sample would be denoted as uncertain if there is an incorrect class that has a relatively high prediction score, and such uncertainty easily propagates to unlabeled target data around it during alignment, which severely deteriorates adaptation performance. Thus we present AUS that emphasizes uncertain samples and exploits an adaptive weighted complement entropy objective to encourage incorrect classes to have uniform and low prediction scores. Experimental results on multiple benchmarks demonstrate our BA$^3$US surpasses state-of-the-arts for partial domain adaptation tasks. Code is available at \url{https://github.com/tim-learn/BA3US}."



Paperid:467
Authors:Yang Li, Shichao Kan, Zhihai He
Title: Unsupervised Deep Metric Learning with Transformed Attention Consistency and Contrastive Clustering Loss
Abstract:
Existing approaches for unsupervised metric learning focus on exploring self-supervision information within the input image itself. We observe that, when analyzing images, human eyes often compare images against each other instead of examining images individually. In addition, they often pay attention to certain keypoints, image regions, or objects which are discriminative between image classes but highly consistent within classes. Even if the image is being transformed, the attention pattern will be consistent. Motivated by this observation, we develop a new approach to unsupervised deep metric learning where the network is learned based on self-supervision information across images instead of within one single image. To characterize the consistent pattern of human attention during image comparisons, we introduce the idea of transformed attention consistency. It assumes that visually similar images, even undergoing different image transforms, should share the same consistent visual attention map. This consistency leads to a pairwise self-supervision loss, allowing us to learn a Siamese deep neural network to encode and compare images against their transformed or matched pairs. To further enhance the inter-class discriminative power of the feature generated by this network, we adapt the concept of triplet loss from supervised metric learning to our unsupervised case and introduce the contrastive clustering loss. Our extensive experimental results on benchmark datasets demonstrate that our proposed method outperforms current state-of-the-art methods for unsupervised metric learning by a large margin."



Paperid:468
Authors:Ali Athar, Sabarinath Mahadevan, Aljosa Osep, Laura Leal-Taixé, Bastian Leibe
Title: STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Abstract:
Existing methods for instance segmentation in videos typically involve multi-stage pipelines that follow the tracking-by detection paradigm and model a video clip as a sequence of images. Multiple networks are used to detect objects in individual frames, and then associate these detections over time. Hence, these methods are often non-end-to-end trainable and highly tailored to specific tasks. In this paper, we propose a different approach that is well-suited to a variety of tasks involving instance segmentation in videos. In particular, we model a video clip as a single 3D spatio-temporal volume, and propose a novel approach that segments and tracks instances across space and time in a single stage. Our problem formulation is centered around the idea of spatio-temporal embeddings which are trained to cluster pixels belonging to a specific object instance over an entire video clip. To this end, we introduce (i) novel mixing functions that enhance the feature representation of spatio-temporal embeddings, and (ii) a single-stage, proposal-free network that can reason about temporal context. Our network is trained end-to-end to learn spatio-temporal embeddings as well as parameters required to cluster these embeddings, thus simplifying inference. Our method achieves state-of-the-art results across multiple datasets and tasks."



Paperid:469
Authors:Jingwei Xu, Huazhe Xu, Bingbing Ni, Xiaokang Yang, Xiaolong Wang, Trevor Darrell
Title: Hierarchical Style-based Networks for Motion Synthesis
Abstract:
Generating diverse and natural behaviors is one of the long-standing goals for creating intelligent characters in the animated world. In this paper, we propose an unsupervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location. Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner. Given the starting and ending states, a memory bank is used to retrieve motion references as source material for short-range clip generation. We first propose to explicitly disentangle the provided motion material into style and content counterparts via bi-linear transformation modelling, where diverse synthesis is achieved by free-form combination of these two components. The short-range clips are then connected to form a long-range motion sequence. Without ground truth annotation, we propose a parameterized bi-directional interpolation scheme to guarantee the physical validity and visual naturalness of generated results. On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion, which is also generalizable to unseen motion data during testing. Moreover, we demonstrate the generated sequences are useful as subgoals for actual physical execution in the animated world. Please refer to our project page~ ootnote{https://sites.google.com/view/hsnms/home} for more synthesised results."



Paperid:470
Authors:Benjamin Biggs, Oliver Boyne, James Charles, Andrew Fitzgibbon, Roberto Cipolla
Title: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop
Abstract:
We introduce an automatic, end-to-end method for recovering the 3D pose and shape of dogs from monocular internet images. The large variation in shape between dog breeds, and significant occlusion and low quality of internet images makes this a challenging problem. We learn a richer prior over shapes than previous work, which helps regularize parameter estimation. We demonstrate results on the Stanford Dog Dataset, an ``in-the-wild'' dataset of 20,580 dog images for which we have collected 2D joint and silhouette annotations to split for training and evaluation. In order to capture the large shape variety of dogs, we show that the natural variation in the 2D dataset is enough to learn a detailed 3D prior through expectation maximisation (EM). As a by-product of training, we generate a new parameterised model (including limb scaling) SMBLD which we release alongside the annotation dataset to the research community."



Paperid:471
Authors:Vishwanath A. Sindagi, Rajeev Yasarla, Deepak Sam Babu, R. Venkatesh Babu, Vishal M. Patel
Title: Learning to Count in the Crowd from Limited Labeled Data
Abstract:
Recent crowd counting approaches have achieved excellent performance. However, they are essentially based on fully supervised paradigm and require large number of annotated samples. Obtaining annotations is an expensive and labour-intensive process. In this work, we focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples while leveraging a large pool of unlabeled data. Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data, which is then used as supervision for training the network. The proposed method is shown to be effective under the reduced data (semi-supervised) settings for several datasets like ShanghaiTech, UCF-QNRF, WorldExpo, UCSD, etc. Furthermore, we demonstrate that the proposed method can be leveraged to enable the network in learning to count from synthetic dataset while being able to generalize better to real-world datasets (synthetic-to-real transfer). "



Paperid:472
Authors:Hongyuan Du, Linjun Li, Bo Liu, Nuno Vasconcelos
Title: SPOT: Selective Point Cloud Voting for Better Proposal in Point Cloud Object Detection
Abstract:
The sparsity of point clouds limits deep learning models on capturing long-range dependencies, which makes features extracted by the models ambiguous. In point cloud object detection, ambiguous features make it hard for detectors to locate object centers and finally lead to bad detection results. In this work, we propose Selective Point clOud voTing (SPOT) module, a simple effective component that can be easily trained end-to-end in point cloud object detectors to solve this problem. Inspired by probabilistic Hough voting, SPOT incorporates an attention mechanism that helps detectors focus on less ambiguous features and preserves their diversity of mapping to multiple object centers. For evaluating our module, we implement SPOT on advanced baseline detectors and test on two benchmark datasets of clutter indoor scenes, ScanNet and SUN RGB-D. Baselines enhanced by our module can stably improve results in agreement by a large margin and achieve new state-or-the-art detection, especially under more strict evaluation metric that adopts larger IoU threshold, implying our module is the key leading to high-quality object detection in point clouds."



Paperid:473
Authors:Jonathan R. Williford, Brandon B. May, Jeffrey Byrne
Title: Explainable Face Recognition
Abstract:
Explainable face recognition (XFR) is the problem of explaining the matches returned by a facial matcher, in order to provide insight into why a probe was matched with one identity over another. In this paper, we provide the first comprehensive benchmark and baseline evaluation for XFR. We define a new evaluation protocol called the ``inpainting game'', which is a curated set of 3648 triplets (probe, mate, nonmate) of 95 subjects, which differ by synthetically inpainting a chosen facial characteristic like the nose, eyebrows or mouth creating an inpainted nonmate. An XFR algorithm is tasked with generating a network attention map which best explains which regions in a probe image match with a mated image, and not with an inpainted nonmate for each triplet. This provides ground truth for quantifying what image regions contribute to face matching. Finally, we provide a comprehensive benchmark on this dataset comparing five state-of-the-art XFR algorithms on three facial matchers. This benchmark includes two new algorithms called subtree EBP and Density-based Input Sampling for Explanation (DISE) which outperform the state-of-the-art XFR by a wide margin."



Paperid:474
Authors:Hieu Le, Dimitris Samaras
Title: From Shadow Segmentation to Shadow Removal
Abstract:
The requirement for paired shadow and shadow-free images limits the size and diversity of shadow removal datasets and hinders the possibility of training large-scale, robust shadow removal algorithms. We propose a shadow removal method that can be trained using only shadow and non-shadow patches cropped from the shadow images themselves. Our method is trained via an adversarial framework, following a physical model of shadow formation. Our central contribution is a set of physics-based constraints that enables this adversarial training. Our method achieves competitive shadow removal results compared to state-of-the-art methods that are trained with fully paired shadow and shadow-free images. The advantages of our training regime are even more pronounced in shadow removal for videos. Our method can be fine-tuned on a testing video with only the shadow masks generated by a pre-trained shadow detector and outperforms state-of-the-art methods on this challenging test. We illustrate the advantages of our method on our proposed video shadow removal dataset. "



Paperid:475
Authors:Seong Hyeon Park, Gyubok Lee, Jimin Seo, Manoj Bhat, Minseok Kang, Jonathan Francis, Ashwin Jadhav, Paul Pu Liang, Louis-Philippe Morency
Title: Diverse and Admissible Trajectory Prediction through Multimodal Context Understanding
Abstract:
Multi-agent trajectory forecasting in autonomous driving requires an agent to accurately anticipate the behaviors of the surrounding vehicles and pedestrians, for safe and reliable decision-making. Due to partial observability in these dynamical scenes, directly obtaining the posterior distribution over future agent trajectories remains a challenging problem. In realistic embodied environments, each agent's future trajectories should be both diverse since multiple plausible sequences of actions can be used to reach its intended goals, and admissible since they must obey physical constraints and stay in drivable areas. In this paper, we propose a model that synthesizes multiple input signals from the multimodal world|the environment's scene context and interactions between multiple surrounding agents|to best model all diverse and admissible trajectories. We compare our model with strong baselines and ablations across two public datasets and show a significant performance improvement over previous state-of-the-art methods. Lastly, we offer new metrics incorporating admissibility criteria to further study and evaluate the diversity of predictions. Codes are at: https://github.com/kami93/CMU-DATF"



Paperid:476
Authors:Marek Kowalski, Stephan J. Garbin, Virginia Estellers, Tadas Baltrušaitis, Matthew Johnson, Jamie Shotton
Title: CONFIG: Controllable Neural Face Image Generation
Abstract:
Our ability to sample realistic natural images, particularly faces, has advanced by leaps and bounds in recent years, yet our ability to exert fine-tuned control over the generative process has lagged behind. If this new technology is to find practical uses, we need to achieve a level of control over generative networks which, without sacrificing realism, is on par with that seen in computer graphics and character animation. To this end we propose ConfigNet, a neural face model that allows for controlling individual aspects of output images in semantically meaningful ways and that is a significant step on the path towards finely-controllable neural rendering. ConfigNet is trained on real face images as well as synthetic face renders. Our novel method uses synthetic data to factorize the latent space into elements that correspond to the inputs of a traditional rendering pipeline, separating aspects such as head pose, facial expression, hair style, illumination, and many others which are very hard to annotate in real data. The real images, which are presented to the network without labels, extend the variety of the generated images and encourage realism. Finally, we propose an evaluation criterion using an attribute detection network combined with a user study and demonstrate state-of-the-art individual control over attributes in the output images."



Paperid:477
Authors:Rui Zhu, Xingyi Yang, Yannick Hold-Geoffroy, Federico Perazzi, Jonathan Eisenmann, Kalyan Sunkavalli, Manmohan Chandraker
Title: Single View Metrology in the Wild
Abstract:
Most 3D reconstruction methods may only recover scene properties up to a global scale ambiguity. We present a novel approach to single view metrology that can recover the absolute scale of a scene represented by 3D heights of objects or camera height above the ground as well as camera parameters of orientation and field of view, using just a monocular image acquired in unconstrained condition. Our method relies on data-driven priors learned by a deep network specifically designed to imbibe weakly supervised constraints from the interplay of the unknown camera with 3D entities such as object heights, through estimation of bounding box projections. We leverage categorical priors for objects such as humans or cars that commonly occur in natural images, as references for scale estimation. We demonstrate state-of-the-art qualitative and quantitative results on several datasets as well as applications including virtual object insertion. Furthermore, the perceptual quality of our outputs is validated by a user study."



Paperid:478
Authors:Chien-Yi Chang, De-An Huang, Danfei Xu, Ehsan Adeli, Li Fei-Fei, Juan Carlos Niebles
Title: Procedure Planning in Instructional Videos
Abstract:
In this paper, we study the problem of procedure planning in instructional videos, which can be seen as the first step towards enabling autonomous agents to plan for complex tasks in everyday settings such as cooking. Given the current visual observation of the world and a visual goal, we ask the question ""What actions need to be taken in order to achieve the goal?"". The key technical challenge is how to learn structured and plannable state and action spaces directly from unstructured real videos. We address this challenge by proposing Dual Dynamics Networks (DDN), a framework that explicitly leverages the structured priors imposed by the conjugate relationships between states and actions in a learned plannable latent space. We evaluate our method on real-world instructional videos. Our experiments show that DDN learns plannable representations that lead to better planning performance compared to existing planning approaches and neural network policies."



Paperid:479
Authors:Ningning Ma, Xiangyu Zhang, Jian Sun
Title: Funnel Activation for Visual Recognition
Abstract:
We present a conceptually simple but effective funnel activation for image recognition tasks, called Funnel activation (FReLU), that extends ReLU and PReLU to a 2D activation by adding a negligible overhead of spatial condition. The forms of ReLU and PReLU are y = max(x, 0) and y = max(x, px), respectively, while FReLU is in the form of y = max(x,T(x)), where T(·) is the 2D spatial condition. Moreover, the spatial condition achieves a pixel-wise modeling capacity in a simple way, capturing complicated visual layouts with regular convolutions. We conduct experiments on ImageNet, COCO detection, and semantic segmentation tasks, showing great improvements and robustness of FReLU in the visual recognition tasks."



Paperid:480
Authors:Shuyang Gu, Jianmin Bao, Dong Chen, Fang Wen
Title: GIQA: Generated Image Quality Assessment
Abstract:
Generative adversarial networks (GANs) have achieved impressive results today, but not all generated images are perfect. A number of quantitative criteria have recently emerged for generative model, but none of them are designed for a single generated image. In this paper, we propose a new research topic, Generated Image Quality Assessment (GIQA), which quantitatively evaluates the quality of each generated image. We introduce three GIQA algorithms from two perspectives: learning-based and data-based. We evaluate a number of images generated by various recent GAN models on different datasets and demonstrate that they are consistent with human assessments. Furthermore, GIQA is available to many applications, like separately evaluate the realism and diversity of generative models, and enable online hard negative mining (OHEM) in the training of GANs to improve the results."



Paperid:481
Authors:Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach
Title: Adversarial Continual Learning
Abstract:
Continual learning aims to learn new tasks without forgetting previously learned ones. We hypothesize that representations learned to solve each task in a sequence have a shared structure while containing some task-specific properties. We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks. Our model combines architecture growth to prevent forgetting of task-specific skills and an experience replay approach to preserve shared skills. We demonstrate our hybrid approach is effective in avoiding forgetting and show it is superior to both architecture-based and memory-based approaches on class incrementally learning of a single dataset as well as a sequence of multiple datasets in image classification. Our code is available at https://github.com/facebookresearch/Adversarial-Continual-Learning



Paperid:482
Authors:Peng Su, Kun Wang, Xingyu Zeng, Shixiang Tang, Dapeng Chen, Di Qiu , Xiaogang Wang
Title: Adapting Object Detectors with Conditional Domain Normalization
Abstract:
Real-world object detectors are often challenged by the domain gaps between different datasets. In this work, we present the Conditional Domain Normalization (CDN) to bridge the domain distribution gap. CDN is designed to encode different domain inputs into a shared latent space, where the features from different domains carry the same domain attribute. To achieve this, we first disentangle the domain-specific attribute out of the semantic features from source domain via a domain embedding module, which learns a domain-vector to characterize the domain attribute information. Then this domain-vector is used to encode the features from target domain through a conditional normalization, resulting in different domains' features carrying the same domain attribute. We incorporate CDN into various convolution stages of an object detector to adaptively address the domain shifts of different level's representation. In contrast to existing adaptation works that conduct domain confusion learning on semantic features to remove domain-specific factors, CDN aligns different domain distributions by modulating the semantic features of target domains conditioned on the learned domain-vector of the source domain. Extensive experiments show that CDN outperforms existing methods remarkably on both real-to-real and synthetic-to-real adaptation benchmarks, including 2D image detection and 3D point cloud detection."



Paperid:483
Authors:Tianjiao Li, Jun Liu, Wei Zhang, Lingyu Duan
Title: HARD-Net: Hardness-AwaRe Discrimination Network for 3D Early Activity Prediction
Abstract:
Predicting the class label from the partially observed activity sequence is a very hard task, as the observed early segments of different activities can be very similar. In this paper, we propose a novel Hardness-AwaRe Discrimination Network (HARD-Net) to specifically investigate the relationships between the similar activity pairs that are hard to be discriminated. Specifically, a Hard Instance-Interference Class (HI-IC) bank is designed, which dynamically records the hard similar pairs. Based on the HI-IC bank, a novel adversarial learning scheme is proposed to train our HARD-Net, which thus grants our network with the strong capability in mining subtle discrimination information for 3D early activity prediction. We evaluate our proposed HARD-Net on two public activity datasets and achieve state-of-the-art performance."



Paperid:484
Authors:Lokender Tiwari, Pan Ji, Quoc-Huy Tran, Bingbing Zhuang, Saket Anand , Manmohan Chandraker
Title: Pseudo RGB-D for Self-Improving Monocular SLAM and Depth Prediction
Abstract:
Classical monocular Simultaneous Localization And Mapping (SLAM) and the recently emerging convolutional neural networks (CNNs) for monocular depth prediction represent two largely disjoint approaches towards building a 3D map of the surrounding environment. In this paper, we demonstrate that the coupling of these two by leveraging the strengths of each mitigates the other’s shortcomings. Specifically, we propose a joint narrow and wide baseline based self-improving framework, where on the one hand the CNN-predicted depth is leveraged to perform $ extit{pseudo}$ RGB-D feature-based SLAM, leading to better accuracy and robustness than the monocular RGB SLAM baseline. On the other hand, the bundle-adjusted 3D scene structures and camera poses from the more principled geometric SLAM are injected back into the depth network through novel wide baseline losses proposed for improving the depth prediction network, which then continues to contribute towards better pose and 3D structure estimation in the next iteration. We emphasize that our framework only requires $ extit{ unlabeled monocular}$ videos in both training and inference stages, and yet is able to outperform state-of-the-art self-supervised $ extit{monocular}$ and $ extit{stereo}$ depth prediction networks (e.g, Monodepth2) and feature based monocular SLAM system (i.e, ORB-SLAM). Extensive experiments on KITTI and TUM RGB-D datasets verify the superiority of our self-improving geometry-CNN framework."



Paperid:485
Authors:Shengcai Liao, Ling Shao
Title: Interpretable and Generalizable Person Re-Identification with Query-Adaptive Convolution and Temporal Lifting
Abstract:
For person re-identification, existing deep networks often focus on representation learning. However, without transfer learning, the learned model is fixed as is, which is not adaptable for handling various unseen scenarios. In this paper, beyond representation learning, we consider how to formulate person image matching directly in deep feature maps. We treat image matching as finding local correspondences in feature maps, and construct query-adaptive convolution kernels on the fly to achieve local matching. In this way, the matching process and results are interpretable, and this explicit matching is more generalizable than representation features to unseen scenarios, such as unknown misalignments, pose or viewpoint changes. To facilitate end-to-end training of this architecture, we further build a class memory module to cache feature maps of the most recent samples of each class, so as to compute image matching losses for metric learning. Through direct cross-dataset evaluation, the proposed Query-Adaptive Convolution (QAConv) method gains large improvements over popular learning methods (about 10%+ mAP), and achieves comparable results to many transfer learning methods. Besides, a model-free temporal cooccurrence based score weighting method called TLift is proposed, which improves the performance to a further extent, achieving state-of-the-art results in cross-dataset person re-identification. Code is available at https://github.com/ShengcaiLiao/QAConv"



Paperid:486
Authors:Tongyao Pang, Yuhui Quan, Hui Ji
Title: Self-supervised Bayesian Deep Learning for Image Recovery with Applications to Compressive Sensing
Abstract:
In recent years, deep learning emerges as one promising technique for solving many ill-posed inverse problems in image recovery, and most deep-learning-based solutions are based on supervised learning. Motivated by the practical value of reducing the cost and complexity of constructing labeled training datasets, this paper proposed a self-supervised deep learning approach for image recovery, which is dataset-free. Built upon Bayesian deep network, the proposed method trains a network with random weights that predicts the target image for recovery with uncertainty. Such uncertainty enables the prediction of the target image with small mean squared error by averaging multiple predictions. The proposed method is applied for image reconstruction in compressive sensing (CS), i.e., reconstructing an image from few measurements. The experiments showed that the proposed dataset-free deep learning method not only significantly outperforms traditional non-learning methods, but also is very competitive to the state-of-the-art supervised deep learning methods, especially when the measurements are few and noisy."



Paperid:487
Authors:Jian Wang, Xiang Long, Yuan Gao, Errui Ding, Shilei Wen
Title: Graph-PCNN: Two Stage Human Pose Estimation with Graph Pose Refinement
Abstract:
Recently, most of the state-of-the-art human pose estimation methods are based on heatmap regression. The final coordinates of keypoints are obtained by decoding heatmap directly. In this paper, we aim to find a better approach to get more accurate localization results. We mainly put forward two suggestions for improvement: 1) different features and methods should be applied for rough and accurate localization, 2) relationship between keypoints should be considered. Specifically, we propose a two-stage graph-based and model-agnostic framework, called Graph-PCNN, with a localization subnet and a graph pose refinement module added onto the original heatmap regression network. In the first stage, heatmap regression network is applied to obtain a rough localization result, and a set of proposal keypoints, called guided points, are sampled. In the second stage, for each guided point, different visual feature is extracted by the localization subnet. The relationship between guided points is explored by the graph pose refinement module to get more accurate localization results. Experiments show that Graph-PCNN can be used in various backbones to boost the performance by a large margin. Without bells and whistles, our best model can achieve a new state-of-the-art 76.8% AP on COCO test-dev split. "



Paperid:488
Authors:Minchul Shin
Title: Semi-supervised Learning with a Teacher-student Network for Generalized Attribute Prediction
Abstract:
This paper presents a study on semi-supervised learning to solve the visual attribute prediction problem. In many applications of vision algorithms, the precise recognition of visual attributes of objects is important but still challenging. This is because defining a class hierarchy of attributes is ambiguous, so training data inevitably suffer from class imbalance and label sparsity, leading to a lack of effective annotations. An intuitive solution is to find a method to effectively learn image representations by utilizing unlabeled images. With that in mind, we propose a multi-teacher-single-student (MTSS) approach inspired by the multi-task learning and the distillation of semi-supervised learning. Our MTSS learns task-specific domain experts called teacher networks using the label embedding technique and learns a unified model called a student network by forcing a model to mimic the distributions learned by domain experts. Our experiments demonstrate that our method not only achieves competitive performance on various benchmarks for fashion attribute prediction, but also improves robustness and cross-domain adaptability for unseen domains."



Paperid:489
Authors:Fang Zhao, Shengcai Liao, Guo-Sen Xie, Jian Zhao, Kaihao Zhang, Ling Shao
Title: Unsupervised Domain Adaptation with Noise Resistible Mutual-Training for Person Re-identification
Abstract:
Unsupervised domain adaptation (UDA) in the task of person re-identification (re-ID) is highly challenging due to large domain divergence and no class overlap between domains. Pseudo-label based self-training is one of the representative techniques to address UDA. However, label noise caused by unsupervised clustering is always a trouble to self-training methods. To depress noises in pseudo-labels, this paper proposes a Noise Resistible Mutual-Training (NRMT) method, which maintains two networks during training to perform collaborative clustering and mutual instance selection. On one hand, collaborative clustering eases the fitting to noisy instances by allowing the two networks to use pseudo-labels provided by each other as an additional supervision. On the other hand, mutual instance selection further selects reliable and informative instances for training according to the peer-confidence and relationship disagreement of the networks. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art UDA methods for person re-ID."



Paperid:490
Authors:Dahlia Urbach, Yizhak Ben-Shabat, Michael Lindenbaum
Title: DPDist: Comparing Point Clouds Using Deep Point Cloud Distance
Abstract:
We introduce a new deep learning method for point cloud comparison. Our approach, named Deep Point Cloud Distance (DPDist), measures the distance between the points in one cloud and the estimated surface from which the other point cloud is sampled. The surface is estimated locally and efficiently using the 3D modified Fisher vector representation. The local representation reduces the complexity of the surface, enabling efficient and effective learning, which generalizes well between object categories. We test the proposed distance in challenging tasks, such as similar object comparison and registration, and show that it provides significant improvements over commonly used distances such as Chamfer distance, Earth mover's distance, and others. "



Paperid:491
Authors:Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, Gang Zeng
Title: Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation
Abstract:
Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion to obtain better feature representations to achieve more accurate segmentation. This, however, may not lead to satisfactory results as actual depth data are generally noisy, which might worsen the accuracy as the networks go deeper. In this paper, we propose a unified and efficient Cross-modality Guided Encoder to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively. The key of the proposed architecture is a novel Separation-and-Aggregation Gating operation that jointly filters and recalibrates both representations before cross-modality aggregation. Meanwhile, a Bi-direction Multi-step Propagation strategy is introduced, on the one hand, to help to propagate and fuse information between the two modalities, and on the other hand, to preserve their specificity along the long-term propagation process. Besides, our proposed encoder can be easily injected into the previous encoder-decoder structures to boost their performance on RGB-D semantic segmentation. Our model outperforms state-of-the-arts consistently on both in-door and out-door challenging datasets. Code of this work is available at https://charlescxk.github.io/."



Paperid:492
Authors:Zhijian Liu, Zhanghao Wu, Chuang Gan, Ligeng Zhu, Song Han
Title: DataMix: Efficient Privacy-Preserving Edge-Cloud Inference
Abstract:
Deep neural networks are widely deployed on edge devices (g, for computer vision and speech recognition). Users either perform the inference locally (\ie, edge-based) or send the data to the cloud and run inference remotely (\ie, cloud-based). However, both solutions have their limitations: edge devices are heavily constrained by insufficient hardware resources and cannot afford to run large models; cloud servers, if not trustworthy, will raise serious privacy issues. In this paper, we mediate between the resource-constrained edge devices and the privacy-invasive cloud servers by introducing a novel extit{privacy-preserving edge-cloud inference} framework, \method. We off-load the majority of the computations to the cloud and leverage a pair of mixing and de-mixing operation, inspired by mixup, to protect the privacy of the data transmitted to the cloud. Our framework has three advantages. First, it is extit{privacy-preserving} as the mixing cannot be inverted without the user's private mixing coefficients. Second, our framework is extit{accuracy-preserving} because our framework takes advantage of the space spanned by images, and we train the model in a mixing-aware manner to maintain accuracy. Third, our solution is extit{efficient} on the edge since the majority of the workload is delegated to the cloud, and our mixing and de-mixing processes introduce very few extra computations. Also, our framework introduces small communication overhead and maintains high hardware utilization on the cloud. Extensive experiments on multiple computer vision and speech recognition datasets demonstrate that our framework can greatly reduce the local computations on the edge (to fewer than 20% of FLOPs) with negligible loss of accuracy and no leakages of private information."



Paperid:493
Authors:Kripasindhu Sarkar, Dushyant Mehta, Weipeng Xu, Vladislav Golyanik, Christian Theobalt
Title: Neural Re-Rendering of Humans from a Single Image
Abstract:
Human re-rendering from a single image is a starkly under-constrained problem and state-of-the-art algorithms often exhibit un-desired artefacts, such as oversmoothing, unrealistic distortions of thebody parts and garments, or implausible changes of the texture. To ad-dress these challenges, we propose a new method for neural re-renderingof a human under a novel user-defined pose and viewpoint given oneinput image. Our algorithm represents body pose and shape as a para-metric mesh which can be reconstructed from a single image and easilyreposed. Instead of a color-based UV texture-map, our approach furtheremploys a learned high-dimensional UV feature-map to encode appear-ance. This rich implicit representation captures detailed appearance vari-ation across poses, viewpoints, person identities and clothing styles bet-ter than learned color texture maps. The body model with the renderedfeature-maps is fed through a neural image-translation network that cre-ates the final rendered color image. The above components are combinedin an end-to-end-trained neural network architecture that takes as in-put a source person image, and images of the parametric body modelin the source pose and desired target pose. Experimental evaluationdemonstrates that our approach produces higher quality single-imagere-rendering results than existing methods."



Paperid:494
Authors:Filippo Aleotti, Fabio Tosi, Li Zhang, Matteo Poggi, Stefano Mattoccia
Title: Reversing the cycle: self-supervised deep stereo through enhanced monocular distillation
Abstract:
In many fields, self-supervised learning solutions are rapidly evolving and filling the gap with supervised approaches. This fact occurs for depth estimation based on either monocular or stereo, with the latter often providing a valid source of self-supervision for the former. In contrast, to soften typical stereo artefacts, we propose a novel self-supervised paradigm reversing the link between the two. Purposely, in order to train deep stereo networks, we distill knowledge through a monocular completion network. This architecture exploits single-image clues and few sparse points, sourced by traditional stereo algorithms, to estimate dense yet accurate disparity maps by means of a consensus mechanism over multiple estimations. We thoroughly evaluate with popular stereo datasets the impact of different supervisory signals showing how stereo networks trained with our paradigm outperform existing self-supervised frameworks. Finally, our proposal achieves notable generalization capabilities dealing with domain shift issues. Code available at https://github.com/FilippoAleotti/Reversing."



Paperid:495
Authors:Jinjin Gu, Haoming Cai, Haoyu Chen, Xiaoxing Ye, Jimmy S. Ren, Chao Dong
Title: PIPAL: a Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration
Abstract:
Image quality assessment (IQA) is the key factor for the fast development of image restoration (IR) algorithms. The most recent IR methods based on Generative Adversarial Networks (GANs) have achieved significant improvement in visual performance, but also presented great challenges for quantitative evaluation. Notably, we observe an increasing inconsistency between perceptual quality and the evaluation results. Then we raise two questions: (1) Can existing IQA methods objectively evaluate recent IR algorithms? (2) When focus on beating current benchmarks, are we getting better IR algorithms? To answer these questions and promote the development of IQA methods, we contribute a large-scale IQA dataset, called Perceptual Image Processing Algorithms (PIPAL) dataset. Especially, this dataset includes the results of GAN-based methods, which are missing in previous datasets. We collect more than 1.13 million human judgments to assign subjective scores for PIPAL images using the more reliable ``Elo system''. Based on PIPAL, we present new benchmarks for both IQA and super-resolution methods. Our results indicate that existing IQA methods cannot fairly evaluate GAN-based IR algorithms. While using appropriate evaluation methods is important, IQA methods should also be updated along with the development of IR algorithms. At last, we improve the performance of IQA networks on GAN-based distortions by introducing anti-aliasing pooling. Experiments show the effectiveness of the proposed method."



Paperid:496
Authors:Bryan A. Plummer, Mariya I. Vasileva, Vitali Petsiuk, Kate Saenko, David Forsyth
Title: Why do These Match? Explaining the Behavior of Image Similarity Models
Abstract:
Explaining a deep learning model can help users understand its behavior and allow researchers to discern its shortcomings. Recent work has primarily focused on explaining models for tasks like image classification or visual question answering. In this paper, we introduce Salient Attributes for Network Explanation (SANE) to explain image similarity models, where a model's output is a score measuring the similarity of two inputs rather than a classification score. In this task, an explanation depends on both of the input images, so standard methods do not apply. Our SANE explanations pairs a saliency map identifying important image regions with an attribute that best explains the match. We find that our explanations provide additional information not typically captured by saliency maps alone, and can also improve performance on the classic task of attribute recognition. Our approach's ability to generalize is demonstrated on two datasets from diverse domains, Polyvore Outfits and Animals with Attributes 2."



Paperid:497
Authors:Xuanhong Chen, Bingbing Ni, Naiyuan Liu, Ziang Liu, Yiliu Jiang, Loc Truong, Qi Tian
Title: CooGAN: A Memory-Efficient Framework for High-Resolution Facial Attribute Editing
Abstract:
In contrast to great success of memory-consuming face editing methods at a low resolution, to manipulate high-resolution (HR) facial images, \ie, typically larger than $768^2$ pixels, with very limited memory is still challenging. This is due to the reasons of 1) intractable huge demand of memory; 2) inefficient multi-scale features fusion. To address these issues, we propose a NOVEL pixel translation framework called mph{Cooperative GAN (CooGAN)} for HR facial image editing. This framework features a local path for fine-grained local facial patch generation (\ie, patch-level HR, LOW memory) and a global path for global low-resolution (LR) facial structure monitoring (\ie, image-level LR, LOW memory), which largely reduce memory requirements. Both paths work in a cooperative manner under a local-to-global consistency objective (\ie, for smooth stitching). In addition, we propose a lighter selective transfer unit for more efficient multi-scale features fusion, yielding higher fidelity facial attributes manipulation. Extensive experiments on CelebA-HQ well demonstrate the memory efficiency as well as the high image generation quality of the proposed framework. \keywords{Generative Adversarial Networks, Conditional GANs, Face Attributes Editing"



Paperid:498
Authors:Ben Saunders, Necati Cihan Camgoz, Richard Bowden
Title: Progressive Transformers for End-to-End Sign Language Production
Abstract:
The goal of automatic Sign Language Production (SLP) is to translate spoken language to a continuous stream of sign language video at a level comparable to a human translator. If this was achievable, then it would revolutionise Deaf hearing communications. Previous work on predominantly isolated SLP has shown the need for architectures that are better suited to the continuous domain of full sign sequences. In this paper, we propose Progressive Transformers, the first SLP model to translate from discrete spoken language sentences to continuous 3D sign pose sequences in an end-to-end manner. A novel counter decoding technique is introduced, that enables continuous sequence generation at training and inference. We present two model configurations, an end-to-end network that produces sign direct from text and a stacked network that utilises a gloss intermediary. We also provide several data augmentation processes to overcome the problem of drift and drastically improve the performance of SLP models. We propose a back translation evaluation mechanism for SLP, presenting benchmark quantitative results on the challenging RWTH-PHOENIX-Weather-2014T (PHOENIX14T) dataset and setting baselines for future research. Code available at https://github.com/BenSaunders27/ProgressiveTransformersSLP."



Paperid:499
Authors:Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, Xiang Bai
Title: Mask TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting
Abstract:
Recent end-to-end trainable methods for scene text spotting, integrating detection and recognition, showed much progress. However, most of the current arbitrary-shape scene text spotters use region proposal networks (RPN) to produce proposals. RPN relies heavily on manually designed anchors and its proposals are represented with axis-aligned rectangles. The former presents difficulties in handling text instances of extreme aspect ratios or irregular shapes, and the latter often includes multiple neighboring instances into a single proposal, in cases of densely oriented text. To tackle these problems, we propose Mask TextSpotter v3, an end-to-end trainable scene text spotter that adopts a Segmentation Proposal Network (SPN) instead of an RPN. Our SPN is anchor-free and gives accurate representations of arbitrary-shape proposals. It is therefore superior to RPN in detecting text instances of extreme aspect ratios or irregular shapes. Furthermore, the accurate proposals produced by SPN allow masked RoI features to be used for decoupling neighboring text instances. As a result, our Mask TextSpotter v3 can handle text instances of extreme aspect ratios or irregular shapes, and its recognition accuracy won't be affected by nearby text or background noise. Specifically, we outperform state-of-the-art methods by 21.9 percent on the Rotated ICDAR 2013 dataset (rotation robustness), 5.9 percent on the Total-Text dataset (shape robustness), and achieve state-of-the-art performance on the MSRA-TD500 dataset (aspect ratio robustness). Code is available at: https://github.com/MhLiao/MaskTextSpotterV3 "



Paperid:500
Authors:Daniel Barath, Michal Polic, Wolfgang Förstner, Torsten Sattler, Tomas Pajdla, Zuzana Kukelova
Title: Making Affine Correspondences Work in Camera Geometry Computation
Abstract:
Local features such as SIFT and its affine and learned variants provide region-to-region rather than point-to-point correspondences. It has recently been exploited to create new minimal solvers for classical problems such as homography, essential and fundamental matrix estimation. The main argument in favor of such solvers is that their minimal sample size is smaller, e.g., only two instead of four matches are required to estimate a homography. Works proposing such solvers often claim a significant improvement in run-time thanks to fewer RANSAC iterations. We show that this argument is not valid in practice if the solvers are used naively as noise in the local feature geometries often causes RANSAC to fail to find an accurate solution. To overcome this, we propose guidelines for effective use of region-to-region matches in the course of a full model estimation pipeline. We propose a method for refining the local feature geometries by symmetric intensity-based matching, combine uncertainty propagation inside RANSAC with preemptive model verification, show a general scheme for computing uncertainty of minimal solvers results, and adapt the sample cheirality check for homography estimation to region-to-region correspondences. Our experiments show that affine solvers can achieve accuracy comparable to point-based solvers at faster run-times when following our guidelines. "



Paperid:501
Authors:Jiankang Deng, Jia Guo, Tongliang Liu, Mingming Gong, Stefanos Zafeiriou
Title: Sub-center ArcFace: Boosting Face Recognition by Large-scale Noisy Web Faces
Abstract:
Margin-based deep face recognition methods (e.g. SphereFace, CosFace, and ArcFace) have achieved remarkable success in unconstrained face recognition. However, these methods are susceptible to the massive label noise in the training data and thus require laborious human effort to clean the datasets. In this paper, we relax the intra-class constraint of ArcFace to improve the robustness to label noise. More specifically, we design $K$ sub-centers for each class and the training sample only needs to be close to any of the $K$ positive sub-centers instead of the only one positive center. The proposed sub-center ArcFace encourages one dominant sub-class that contains the majority of clean faces and non-dominant sub-classes that include hard or noisy faces. Extensive experiments confirm the robustness of sub-center ArcFace under massive real-world noise. After the model achieves enough discriminative power, we directly drop non-dominant sub-centers and high-confident noisy samples, which helps recapture intra-compactness, decrease the influence from noise, and achieve comparable performance compared to ArcFace trained on the manually cleaned dataset. By taking advantage of the large-scale raw web faces (Celeb500K), sub-center Arcface achieves state-of-the-art performance on IJB-B, IJB-C, MegaFace, and FRVT."



Paperid:502
Authors:Chuang Gan, Deng Huang, Peihao Chen, Joshua B. Tenenbaum, Antonio Torralba
Title: Foley Music: Learning to Generate Music from Videos
Abstract:
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music generator: body keypoints from videos and MIDI events from audio recordings. We then formulate music generation from videos as a motion-to-midi translation problem. We present a Graph-Transformer framework that can accurately predict MIDI event sequences in accordance with the body movements; The MIDI event can then be converted to realistic music using an off-the-shelf music synthesizer tool. We demonstrate the effectiveness of our models on videos containing a variety of music performances. Experimental results show that our model outperforms several existing systems in generating music that is pleasant to listen to. More importantly, the MIDI representations are fully interpretable and transparent, thus enabling us to perform music editing flexibly. We encourage the readers to watch the supplementary video with audio turned on to experience the results. "



Paperid:503
Authors:Yonglong Tian, Dilip Krishnan, Phillip Isola
Title: Contrastive Multiview Coding
Abstract:
Humans view the world through many sensory channels, e.g., the long-wavelength light channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right ear. Each view is noisy and incomplete, but important factors, such as physics, geometry, and semantics, tend to be shared between all views (e.g., a ``dog"" can be seen, heard, and felt). We investigate the classic hypothesis that a powerful representation is one that models view-invariant factors. We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. Our approach scales to any number of views, and is view-agnostic. We analyze key properties of the approach that make it work, finding that the contrastive loss outperforms a popular alternative based on cross-view prediction, and that the more views we learn from, the better the resulting representation captures underlying scene semantics."



Paperid:504
Authors:Yingwei Li, Song Bai, Cihang Xie, Zhenyu Liao, Xiaohui Shen, Alan Yuille
Title: Regional Homogeneity: Towards Learning Transferable Universal Adversarial Perturbations Against Defenses
Abstract:
This paper focuses on learning transferable adversarial examples specifically against defense models (models to defense adversarial attacks). In particular, we show that a simple universal perturbation can fool a series of state-of-the-art defenses.Adversarial examples generated by existing attacks are generally hard to transfer to defense models. We observe the property of regional homogeneity in adversarial perturbations and suggest that the defenses are less robust to regionally homogeneous perturbations. Therefore, we propose an effective transforming paradigm and a customized gradient transformer module to transform existing perturbations into regionally homogeneous ones. Without explicitly forcing the perturbations to be universal, we observe that a well-trained gradient transformer module tends to output input-independent gradients (hence universal) benefiting from the under-fitting phenomenon. Thorough experiments demonstrate that our work significantly outperforms the prior art attacking algorithms (either image-dependent or universal ones) by an average improvement of 14.0% when attacking 9 defenses in the transfer-based attack setting. In addition to the cross-model transferability, we also verify that regionally homogeneous perturbations can well transfer across different vision tasks (attacking with the semantic segmentation task and testing on the object detection task). The code is available here: https://github.com/LiYingwei/Regional-Homogeneity."



Paperid:505
Authors:Shoukai Xu, Haokun Li, Bohan Zhuang, Jing Liu, Jiezhang Cao, Chuangrun Liang, Mingkui Tan
Title: Generative Low-bitwidth Data Free Quantization
Abstract:
Neural network quantization is an effective way to compress deep models and improve their execution latency and energy efficiency, so that they can be deployed on mobile or embedded devices. Existingquantization methods require original data for calibration or fine-tuning to get better performance. However, in many real-world scenarios, the data may not be available due to confidential or private issues, thereby making existing quantization methods not applicable. Moreover, due to the absence of original data, the recently developed generative adversarial networks (GANs) cannot be applied to generate data. Although the full-precision model may contain rich data information, such information alone is hard to exploit for recovering the original data or generating new meaningful data. In this paper, we investigate a simple-yet-effective method called Generative Low-bitwidth Data Free Quantization(GDFQ) to remove the data dependence burden. Specifically, we propose a knowledge matching generator to produce meaningful fake data by exploiting classification boundary knowledge and distribution information in the pre-trained model. With the help of generated data, we can quantize a model by learning knowledge from the pre-trained model. Extensive experiments on three data sets demonstrate the effectiveness of our method. More critically, our method achieves much higher accuracy on4-bit quantization than the existing data free quantization method. Code is available at https://github.com/xushoukai/GDFQ."



Paperid:506
Authors:Xiaojie Li, Jianlong Wu, Hongyu Fang, Yue Liao, Fei Wang, Chen Qian
Title: Local Correlation Consistency for Knowledge Distillation
Abstract:
Sufficient knowledge extraction from the teacher network plays a critical role in the knowledge distillation task to improve the performance of the student network. Existing methods mainly focus on the consistency of instance-level features and their relationships, but neglect the local features and their correlation, which also contain many details and discriminative patterns. In this paper, we propose the local correlation exploration framework for knowledge distillation. It models three kinds of local knowledge, including intra-instance local relationship, inter-instance relationship on the same local position, and the inter-instance relationship across different local positions. Moreover, to make the student focus on those informative local regions of the teacher's feature maps, we propose a novel class-aware attention module to highlight the class-relevant regions and remove the confusing class-irrelevant regions, which makes the local correlation knowledge more accurate and valuable. We conduct extensive experiments and ablation studies on challenging datasets, including CIFAR100 and ImageNet, to show our superiority over the state-of-the-art methods."



Paperid:507
Authors:Jason Y. Zhang, Sam Pepose, Hanbyul Joo, Deva Ramanan, Jitendra Malik, Angjoo Kanazawa
Title: Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild
Abstract:
We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to ``3D common sense"" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively unexplored domain. The project webpage can be found at https://jasonyzhang.com/phosa."



Paperid:508
Authors:Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang, Ziwei Liu
Title: Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation
Abstract:
Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone."



Paperid:509
Authors:Yuanhan Zhang, ZhenFei Yin, Yidong Li, Guojun Yin, Junjie Yan, Jing Shao, Ziwei Liu
Title: CelebA-Spoof: Large-Scale Face Anti-Spoofing Dataset with Rich Annotations
Abstract:
As facial interaction systems are prevalently deployed, security and reliability of these systems become a critical issue, with substantial research efforts devoted. Among them, face anti-spoofing emerges as an important area, whose objective is to identify whether a presented face is live or spoof. Though promising progress has been achieved, existing works still have difficulty in handling complex spoof attacks and generalizing to real-world scenarios. The main reason is that current face anti-spoofing datasets are limited in both quantity and diversity. To overcome these obstacles, we contribute a large-scale face anti-spoofing dataset, extbf{CelebA-Spoof}, with the following appealing properties: extit{1) Quantity:} CelebA-Spoof comprises of 625,537 pictures of 10,177 subjects, significantly larger than the existing datasets. extit{2) Diversity:} The spoof images are captured from 8 scenes (2 environments all_papers.txt decode_tex_noligatures.sh decode_tex_noligatures.sh~ decode_tex.sh decode_tex.sh~ ECCV_abstracts.csv ECCV_abstracts_good.csv ECCV_abstracts_good.csv.old ECCV_abstracts_ori.csv ECCV.csv ECCV.csv~ ECCV_new.csv generate_list.sh generate_list.sh~ generate_overview.sh generate_overview.sh~ gen.sh pdflist pdflist.copied RCS snippet.html 4 illumination conditions) with more than 10 sensors. extit{3) Annotation Richness:} CelebA-Spoof contains 10 spoof type annotations, as well as the 40 attribute annotations inherited from the original CelebA dataset. Equipped with CelebA-Spoof, we carefully benchmark existing methods in a unified multi-task framework, extbf{Auxiliary Information Embedding Network (AENet)}, and reveal several valuable observations. Our key insight is that, compared with the commonly-used binary supervision or mid-level geometric representations, rich semantic annotations as auxiliary tasks can greatly boost the performance and generalizability of face anti-spoofing across a wide range of spoof attacks. Through comprehensive studies, we show that CelebA-Spoof serves as an effective training data source. Models trained on CelebA-Spoof (without fine-tuning) exhibit state-of-the-art performance on standard benchmarks such as CASIA-MFSD. The datasets are available at \href{https://github.com/Davidzhangyuanhan/CelebA-Spoof}{https://github.com/Davidzhangyuanhan/CelebA-Spoof} . "



Paperid:510
Authors:Yuyang Qian, Guojun Yin, Lu Sheng, Zixuan Chen, Jing Shao
Title: Thinking in Frequency: Face Forgery Detection by Mining Frequency-aware Clues
Abstract:
As realistic facial manipulation technologies have achieved remarkable progress, social concerns about potential malicious abuse of these technologies bring out an emerging research topic of face forgery detection. However, it is extremely challenging since recent advances are able to forge faces beyond the perception ability of human eyes, especially in compressed images and videos. We find that mining forgery patterns with the awareness of frequency could be a cure, as frequency provides a complementary viewpoint where either subtle forgery artifacts or compression errors could be well described. To introduce frequency into the face forgery detection, we propose a novel Frequency in Face Forgery Network (F$^3$-Net), taking advantages of two different but complementary frequency-aware clues, 1) frequency-aware decomposed image components, and 2) local frequency statistics, to deeply mine the forgery patterns via our two-stream collaborative learning framework. We apply DCT as the applied frequency-domain transformation. Through comprehensive studies, we show that the proposed F$^3$-Net significantly outperforms competing state-of-the-art methods on all compression qualities in the challenging FaceForensics++ dataset, especially wins a big lead upon low-quality media."



Paperid:511
Authors:Kazuya Nishimura, Junya Hayashida, Chenyang Wang, Dai Fei Elmer Ker, Ryoma Bise
Title: Weakly-Supervised Cell Tracking via Backward-and-Forward Propagation
Abstract:
We propose a weakly-supervised cell tracking method that can train a convolutional neural network (CNN) by using only the annotation of ""cell detection"" (i.e., the coordinates of cell positions) without association information, in which cell positions can be easily obtained by nuclear staining. First, we train a co-detection CNN that detects cells in successive frames by using weak-labels. Our key assumption is that the co-detection CNN implicitly learns association in addition to detection. To obtain the association information, we propose a backward-and-forward propagation method that analyzes the correspondence of cell positions in the detection maps output of the co-detection CNN. Experiments demonstrated that the proposed method can match positions by analyzing the co-detection CNN. Even though the method uses only weak supervision, the performance of our method was almost the same as the state-of-the-art supervised method."



Paperid:512
Authors:John Yang, Hyung Jin Chang, Seungeui Lee, Nojun Kwak
Title: SeqHAND: RGB-Sequence-Based 3D Hand Pose and Shape Estimation
Abstract:
3D hand pose estimation based on RGB images has been studied for a long time. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. In this paper, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks."



Paperid:513
Authors:Zijie Zhuang, Longhui Wei, Lingxi Xie, Tianyu Zhang, Hengheng Zhang , Haozhe Wu, Haizhou Ai, Qi Tian
Title: Rethinking the Distribution Gap of Person Re-identification with Camera-based Batch Normalization
Abstract:
The fundamental difficulty in person re-identification (ReID) lies in learning the correspondence among individual cameras. It strongly demands costly inter-camera annotations, yet the trained models are not guaranteed to transfer well to previously unseen cameras. These problems significantly limit the application of ReID. This paper rethinks the working mechanism of conventional ReID approaches and puts forward a new solution. With an effective operator named Camera-based Batch Normalization (CBN), we force the image data of all cameras to fall onto the same subspace, so that the distribution gap between any camera pair is largely shrunk. This alignment brings two benefits. First, the trained model enjoys better abilities to generalize across scenarios with unseen cameras as well as transfer across multiple training sets. Second, we can rely on intra-camera annotations, which have been undervalued before due to the lack of cross-camera information, to achieve competitive ReID performance. Experiments on a wide range of ReID tasks demonstrate the effectiveness of our approach. The code is available at https://github.com/automan000/Camera-based-Person-ReID."



Paperid:514
Authors:Xiaobing Zhang, Shijian Lu, Haigang Gong, Zhipeng Luo, Ming Liu
Title: AMLN: Adversarial-based Mutual Learning Network for Online Knowledge Distillation
Abstract:
Online knowledge distillation has attracted increasing interest recently, which jointly learns teacher and student models or an ensemble of student models simultaneously and collaboratively. On the other hand, existing works focus more on outcome-driven learning according to knowledge like classification probabilities whereas the distilling processes which capture rich and useful intermediate features and information are largely neglected. In this work, we propose an innovative adversarial-based mutual learning network (AMLN) that introduces process-driven learning beyond outcome-driven learning for augmented online knowledge distillation. A block-wise training module is designed which guides the information flow and mutual learning among peer networks adversarially throughout different learning stages, and this spreads until the final network layer which captures more high-level information. AMLN has been evaluated under a variety of network architectures over three widely used benchmark datasets. Extensive experiments show that AMLN achieves superior performance consistently against state-of-the-art knowledge transfer methods."



Paperid:515
Authors:Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, Dahua Lin
Title: Online Multi-modal Person Search in Videos
Abstract:
The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identifies can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require real-time responses. In this paper, we propose an online person search framework, which can recognize people in a video on the fly. This framework maintains a multi-modal memory bank at its heart as the basis for person recognition, and updates it dynamically with a policy obtained by reinforcement learning. Our experiments on a large movie dataset show that the proposed method is effective, not only achieving remarkable improvements over strong online schemes but also outperforming offline methods. "



Paperid:516
Authors:Ben Niu, Weilei Wen, Wenqi Ren, Xiangde Zhang, Lianping Yang, Shuzhen Wang, Kaihao Zhang, Xiaochun Cao, Haifeng Shen
Title: Single Image Super-Resolution via a Holistic Attention Network
Abstract:
Informative features play a crucial role in the single image super-resolution task. Channel attention has been demonstrated to be effective for preserving information-rich features in each layer. However, channel attention treats each convolution layer as a separate process, which is kind of missing the correlation among different layers. To address this problem, we propose a new holistic attention network (HAN), which consists of a layer attention module (LAM) and a channel-spatial attention module (CSAM), to model the holistic interdependencies among layers, channels, and positions. Specifically, the proposed LAM adaptively emphasizes hierarchical features by considering correlations among layers. Meanwhile, CSAM learns the confidence at all the positions of each channel to selectively capture more informative features. Extensive experiments demonstrate that the proposed HAN performs favorably against the state-of-the-art single image super-resolution approaches."



Paperid:517
Authors:Amir Markovitz, Inbal Lavi, Or Perel, Shai Mazor, Roee Litman
Title: Can You Read Me Now? Content Aware Rectification using Angle Supervision
Abstract:
The ubiquity of smartphone cameras has led to more and more documents being captured by cameras rather than scanned. Unlike flatbed scanners, photographed documents are often folded and crumpled, resulting in large local variance in text structure. The problem of document rectification is fundamental to the Optical Character Recognition (OCR) process on documents, and its ability to overcome geometric distortions significantly affects recognition accuracy. Despite the great progress in recent OCR systems, most still rely on a pre-process that ensures the text lines are straight and axis aligned. Recent works have tackled the problem of rectifying document images taken in-the-wild using various supervision signals and alignment means. However, they focused on global features that can be extracted from the document's boundaries, ignoring various signals that could be obtained from the document's content. We present CREASE: Content Aware Rectification using Angle Supervision, the first learned method for document rectification that relies on the document's content, the location of the words and specifically their orientation, as hints to assist in the rectification process. We utilize a novel pixel-wise angle regression approach and a curvature estimation side-task for optimizing our rectification model. Our method surpasses previous approaches in terms of OCR accuracy, geometric error and visual similarity."



Paperid:518
Authors:Hongwei Yong, Jianqiang Huang, Deyu Meng, Xiansheng Hua, Lei Zhang
Title: Momentum Batch Normalization for Deep Learning with Small Batch Size
Abstract:
Normalization layers play an important role in deep network training. As one of the most popular normalization techniques, batch normalization (BN) has shown its effectiveness in accelerating the model training speed and improving model generalization capability. The success of BN has been explained from different views, such as reducing internal covariate shift, allowing the use of large learning rate, smoothing optimization landscape, etc. To make a deeper understanding of BN, in this work we prove that BN actually introduces a certain level of noise into the sample mean and variance during the training process, while the noise level depends only on the batch size. Such a noise generation mechanism of BN regularizes the training process, and we present an explicit regularizer formulation of BN. Since the regularization strength of BN is determined by the batch size, a small batch size may cause the under-fitting problem, resulting in a less effective model. To reduce the dependency of BN on batch size, we propose a momentum BN (MBN) scheme by averaging the mean and variance of current mini-batch with the historical means and variances. With a dynamic momentum parameter, we can automatically control the noise level in the training process. As a result, MBN works very well even when the batch size is very small (e.g., 2), which is hard to achieve by traditional BN."



Paperid:519
Authors:Abdullah Hamdi, Sara Rojas, Ali Thabet, Bernard Ghanem
Title: AdvPC: Transferable Adversarial Perturbations on 3D Point Clouds
Abstract:
Deep neural networks are vulnerable to adversarial attacks, in which imperceptible perturbations to their input lead to erroneous network predictions. This phenomenon has been extensively studied in the image domain, and has only recently been extended to 3D point clouds. In this work, we present novel data-driven adversarial attacks against 3D point cloud networks. We aim to address the following problems in current 3D point cloud adversarial attacks: they do not transfer well between different networks, and they are easy to defend against via simple statistical methods. To this extent, we develop a new point cloud attack (dubbed AdvPC) that exploits the input data distribution by adding an adversarial loss, after Auto-Encoder reconstruction, to the objective it optimizes. AdvPC leads to perturbations that are resilient against current defenses, while remaining highly transferable compared to state-of-the-art attacks. We test AdvPC using four popular point cloud networks: PointNet, PointNet++ (MSG and SSG), and DGCNN. Our proposed attack increases the attack success rate by up to 40% for those transferred to unseen networks (transferability), while maintaining a high success rate on the attacked network. AdvPC also increases the ability to break defenses by up to 38% as compared to other baselines on the ModelNet40 dataset. The code is available at https://github.com/ajhamdi/AdvPC ."



Paperid:520
Authors:Gusi Te, Yinglu Liu, Wei Hu, Hailin Shi, Tao Mei
Title: Edge-aware Graph Representation Learning and Reasoning for Face Parsing
Abstract:
Face parsing infers a pixel-wise label to each facial component, which has drawn much attention recently. Previous methods have shown their efficiency in face parsing, which however overlook the correlation among different face regions. The correlation is a critical clue about the facial appearance, pose, expression etc., and should be taken into account for face parsing. To this end, we propose to model and reason the region-wise relations by learning graph representations, and leverage the edge information between regions for optimized abstraction. Specifically, we encode a facial image onto a global graph representation where a collection of pixels (""regions"") with similar features are projected to each vertex. Our model learns and reasons over relations between the regions by propagating information across vertices on the graph. Furthermore, we incorporate the edge information to aggregate the pixel-wise features onto vertices, which emphasizes on the features around edges for fine segmentation along edges. The finally learned graph representation is projected back to pixel grids for parsing. Experiments demonstrate that our model outperforms state-of-the-art methods on the widely used Helen dataset, and also exhibits the superior performance on the large-scale CelebAMask-HQ dataset."



Paperid:521
Authors:Deng-Ping Fan, Yingjie Zhai, Ali Borji, Jufeng Yang, Ling Shao
Title: BBS-Net: RGB-D Salient Object Detection with a Bifurcated Backbone Strategy Network
Abstract:
Multi-level feature fusion is a fundamental topic in computer vision for detecting, segmenting, and classifying objects at various scales. When multi-level features meet multi-modal cues, the optimal fusion problem becomes a hot potato. In this paper, we make the first attempt to leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to develop a novel cascaded refinement network. In particular, we 1) propose a bifurcated backbone strategy (BBS) to split the multi-level features into teacher and student features, and 2) utilize a depth-enhanced module (DEM) to excavate informative parts of depth cues from the channel and spatial views. This fuses RGB and depth modalities in a complementary way. Our simple yet efficient architecture, dubbed Bifurcated Backbone Strategy Network (BBS-Net), is backbone independent, and outperforms 18 SOTAs on seven challenging datasets using four metrics."



Paperid:522
Authors:Behnaz Rezaei, Amirreza Farnoosh, Sarah Ostadabbas
Title: G-LBM:Generative Low-dimensional Background Model Estimation from Video Sequences
Abstract:
In this paper, we propose a computationally tractable and theoretically supported non-linear low-dimensional generative model to represent real-world data in the presence of noise and sparse outliers. The non-linear low-dimensional manifold discovery of data is done through describing a joint distribution over observations, and their low-dimensional representations (i.e. manifold coordinates). Our model, called generative low-dimensional background model (G-LBM) admits variational operations on the distribution of the manifold coordinates and simultaneously generates a low-rank structure of the latent manifold given the data. Therefore, our probabilistic model contains the intuition of the non-probabilistic low-dimensional manifold learning. G-LBM selects the intrinsic dimensionality of the underling manifold of the observations, and its probabilistic nature models the noise in the observation data. G-LBM has direct application in the background scenes model estimation from video sequences and we have evaluated its performance on SBMnet-2016 and BMC2012 datasets, where it achieved a performance higher or comparable to other state-of-the-art methods while being agnostic to the background scenes in videos. Besides, in challenges such as camera jitter and background motion, G-LBM is able to robustly estimate the background by effectively modeling the uncertainties in video observations in these scenarios."



Paperid:523
Authors:Zaiwei Zhang, Bo Sun, Haitao Yang, Qixing Huang
Title: H3DNet: 3D Object Detection Using Hybrid Geometric Primitives
Abstract:
We introduce H3DNet, which takes a colorless 3D point cloud as input and outputs a collection of oriented object bounding boxes (or BB) and their semantic labels. The critical idea of H3DNet is to predict a hybrid set of geometric primitives, i.e., BB centers, BB face centers, and BB edge centers. We show how to convert the predicted geometric primitives into object proposals by defining a distance function between an object and the geometric primitives. This distance function enables continuous optimization of object proposals, and its local minimums provide high-fidelity object proposals. H3DNet then utilizes a matching and refinement module to classify object proposals into detected objects and fine-tune the geometric parameters of the detected objects. The hybrid set of geometric primitives not only provides more accurate signals for object detection than using a single type of geometric primitives, but it also provides an overcomplete set of constraints on the resulting 3D layout. Therefore, H3DNet can tolerate outliers in predicted geometric primitives. Our model achieves state-of-the-art 3D detection results on two large datasets with real 3D scans, ScanNet and SUN RGB-D."



Paperid:524
Authors:Hang Chu, Shugao Ma, Fernando De la Torre, Sanja Fidler, Yaser Sheikh
Title: Expressive Telepresence via Modular Codec Avatars
Abstract:
VR telepresence consists of interacting with another human in a virtual space represented by an avatar. Today most avatars are cartoon-like, but soon the technology will allow video-realistic ones. This paper aims in this direction and presents Modular Codec Avatars (MCA), a method to generate hyper-realistic faces driven by the cameras in the VR headset. MCA extends traditional Codec Avatars (CA) by replacing the holistic models with a learned modular representation. It is important to note that traditional person-specific CAs are learned from few training samples, and typically lack robustness as well as limited expressiveness when transferring facial expressions. MCAs solve these issues by learning a modulated adaptive blending of different facial components as well as an exemplar-based latent alignment. We demonstrate that MCA achieves improved expressiveness and robustness w.r.t to CA in a variety of real-world datasets and practical scenarios. Finally, we showcase new applications in VR telepresence enabled by the proposed model."



Paperid:525
Authors:Ao Luo, Xin Li, Fan Yang, Zhicheng Jiao, Hong Cheng, Siwei Lyu
Title: Cascade Graph Neural Networks for RGB-D Salient Object Detection
Abstract:
In this paper, we study the problem of salient object detection for RGB-D images by using both color and depth information. A major technical challenge for detecting salient objects in RGB-D images is to fully leverage the two complementary data sources. The existing works either simply distill prior knowledge from the corresponding depth map to handle the RGB-image or blindly fuse color and geometric information to generate the depth-aware representations, hindering the performance of RGB-D saliency detectors. In this work, we introduce Cascade Graph Neural Networks (Cas-Gnn), a unified framework which is capable of comprehensively distilling and reasoning the mutual benefit between these two data sources through a set of cascade graphs, to learn powerful representations for RGB-D salient object detection. Cas-Gnn processes the two data sources separately and employs a novel Cascade Graph Reasoning (CGR) module to learn the powerful dense feature embeddings so that the saliency map can be easily inferred. Different from previous approaches, Cas-Gnn, by explicitly modeling and reasoning high-level relations between complementary data sources, can overcome many challenges like occlusions and ambiguities. Extensive experiments on several widely-used benchmarks demonstrate that CasGnn achieves significantly better performance than all existing RGB-D SOD approaches."



Paperid:526
Authors:Vishnu Suresh Lokhande, Aditya Kumar Akash, Sathya N. Ravi, Vikas Singh
Title: FairALM: Augmented Lagrangian Method for Training Fair Models with Little Regret
Abstract:
Algorithmic decision making based on computer vision and machine learning technologies continue to permeate our lives. But issues related to biases of these models and the extent to which they treat certain segments of the population unfairly, have led to concern in the general public. It is now accepted that because of biases in the datasets we present to the models, a fairness-oblivious training will lead to unfair models. An interesting topic is the study of mechanisms via which the de novo design or training of the model can be informed by fairness measures. Here, we study mechanisms that impose fairness concurrently while training the model. While existing fairness based approaches in vision have largely relied on training adversarial modules together with the primary classification/regression task, in an effort to remove the influence of the protected attribute or variable, we show how ideas based on well-known optimization concepts can provide a simpler alternative. In our proposed scheme, imposing fairness just requires specifying the protected attribute and utilizing our optimization routine. We provide a detailed technical analysis and present experiments demonstrating that various fairness measures from the literature can be reliably imposed on a number of training tasks in vision in a manner that is interpretable."



Paperid:527
Authors:Megha Nawhal, Mengyao Zhai, Andreas Lehrmann, Leonid Sigal, Greg Mori
Title: Generating Videos of Zero-Shot Compositions of Actions and Objects
Abstract:
Human activity videos involve rich, varied interactions between people and objects. In this paper we develop methods for generating such videos -- making progress toward addressing the important, open problem of video generation in complex scenes. In particular, we introduce the task of generating human-object interaction videos in a zero-shot compositional setting, i.e., generating videos for action-object compositions that are unseen during training, having seen the target action and target object separately. This setting is particularly important for generalization in human activity video generation, obviating the need to observe every possible action-object combination in training and thus avoiding the combinatorial explosion involved in modeling complex scenes. To generate human-object interaction videos, we propose a novel adversarial framework HOI-GAN which includes multiple discriminators focusing on different aspects of a video. To demonstrate the effectiveness of our proposed framework, we perform extensive quantitative and qualitative evaluation on two challenging datasets: EPIC-Kitchens and 20BN-Something-Something v2."



Paperid:528
Authors:Zhe Wang, Zhiyuan Fang, Jun Wang, Yezhou Yang
Title: ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language
Abstract:
Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances."



Paperid:529
Authors:Lu Yang, Qing Song, Zhihui Wang, Mengjie Hu, Chun Liu, Xueshi Xin, Wenhe Jia, Songcen Xu
Title: Renovating Parsing R-CNN for Accurate Multiple Human Parsing
Abstract:
Multiple human parsing aims to segment various human parts and associate each part with the corresponding instance simultaneously. This is a very challenging task due to the diverse human appearance, semantic ambiguity of different body parts and clothing, and complex background. Through analysis of human parsing task, we observe that human-centric context perception and accurate instance-level parsing scoring are particularly important for obtaining high-quality results. But the most state-of-the-art methods have not paid enough attention to these problems. To reverse this phenomenon, we present Renovating Parsing R-CNN (RP R-CNN), which introduces a global semantic enhanced feature pyramid network and a parsing re-scoring network into the existing high-performance pipeline. The proposed RP R-CNN adopts global semantic feature to enhance multi-scale features for generating human parsing, and regresses a confidence score to represent its quality. Extensive experiments show that RP R-CNN performs favorably against state-of-the-art methods on CIHP and MHP-v2 datasets. Code and models will be publicly available."



Paperid:530
Authors:Qing Yu, Daiki Ikami, Go Irie, Kiyoharu Aizawa
Title: Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning
Abstract:
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available. While existing SSL methods assume that samples in the labeled and unlabeled data share the classes of their samples, we address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data. Instead of training an OOD detector and SSL separately, we propose a multi-task curriculum learning framework. First, to detect the OOD samples in unlabeled data, we estimate the probability of the sample belonging to OOD. We use a joint optimization framework, which updates the network parameters and the OOD score alternately. Simultaneously, to achieve high performance on the classification of in-distribution (ID) data, we select ID samples in unlabeled data having small OOD scores, and use these data with labeled data for training the deep neural networks to classify ID samples in a semi-supervised manner. We conduct several experiments, and our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples."



Paperid:531
Authors:Zhao Zhang, Wenda Jin, Jun Xu, Ming-Ming Cheng
Title: Gradient-Induced Co-Saliency Detection
Abstract:
Co-saliency detection (Co-SOD) aims to segment the common salient foreground in a group of relevant images. In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection (GICD) method. We first abstract a consensus representation for the grouped images in the embedding space; then, by comparing the single image with consensus representation, we utilize the feedback gradient information to induce more attention to the discriminative co-salient features. In addition, due to the lack of Co-SOD training data, we design a jigsaw training strategy, with which Co-SOD networks can be trained on general saliency datasets without extra pixel-level annotations. To evaluate the performance of Co-SOD methods on discovering the co-salient object among multiple foregrounds, we construct a challenging CoCA dataset, where each image contains at least one extraneous foreground along with the co-salient object. Experiments demonstrate that our GICD achieves state-of-the-art performance. Our codes and dataset are available at https://mmcheng.net/gicd/."



Paperid:532
Authors:Wending Yan, Robby T. Tan, Dengxin Dai
Title: Nighttime Defogging Using High-Low Frequency Decomposition and Grayscale-Color Networks
Abstract:
In this paper, we address the problem of nighttime defogging from a single image. We propose a framework consisting of two main modules: grayscale and color modules. Given an RGB foggy nighttime image, our grayscale module takes the grayscale version of the image as input, and decomposes it into high and low frequency layers. The high frequency layer contains the scene texture information, which is less affected by fog; and, the low frequency layer contains the scene layout/structure information including fog and glow. Our grayscale module then enhances the visibility of the textures in the high frequency layers, and removes the presence of glow and fog in the low frequency layers. Having processed the high/low frequency information, it fuses the layers to obtain a grayscale defogged image. Our second module, the color module, takes the original RGB image as input. The module has similar operations to those of the grayscale module. However, to obtain fog-free high and low frequency information, the module is guided by the grayscale module. The reason of doing this is because grayscale images are less affected by multiple colors of atmospheric light, which are commonly present in nighttime scenes. Moreover, having the grayscale module allows us to have consistency losses between the outputs of the two modules, which is critical to our framework, since we do not have paired ground-truths for our real data. Our extensive experiments on real foggy nighttime images show the effectiveness of our method."



Paperid:533
Authors:Yuhui Yuan, Jingyi Xie, Xilin Chen, Jingdong Wang
Title: SegFix: Model-Agnostic Boundary Refinement for Segmentation
Abstract:
We present a model-agnostic post-processing scheme to improve the boundary quality for the segmentation result that is generated by any existing segmentation model. Motivated by the empirical observation that the label predictions of interior pixels are more reliable, we propose to replace the originally unreliable predictions of boundary pixels by the predictions of interior pixels. Our approach processes only the input image through two steps: (i) localize the boundary pixels and (ii) identify the corresponding interior pixel for each boundary pixel. We build the correspondence by learning a direction away from the boundary pixel to an interior pixel. Our method requires no prior information of the segmentation models and achieves nearly real-time speed. We empirically verify that our SegFix consistently reduces the boundary errors for segmentation results generated from various state-of-the-art models on Cityscapes, ADE20K and GTA5. "



Paperid:534
Authors:Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, Shuai Yi
Title: Spatio-Temporal Graph Transformer Networks for Pedestrian Trajectory Prediction
Abstract:
Understanding crowd motion dynamics is critical to real-world applications, e.g., surveillance systems and autonomous driving. This is challenging because it requires effectively modeling the socially aware crowd spatial interaction and complex temporal dependencies. We believe attention is the most important factor for trajectory prediction. In this paper, we present STAR, a Spatio-Temporal grAph tRansformer framework, which tackles trajectory prediction by only attention mechanisms. STAR models intra-graph crowd interaction by TGConv, a novel Transformer-based graph convolution mechanism. The inter-graph temporal dependencies are modeled by separate temporal Transformers. STAR captures complex spatio-temporal interactions by interleaving between spatial and temporal Transformers. To calibrate the temporal prediction for the long-lasting effect of disappeared pedestrians, we introduce a read-writable external memory module, consistently being updated by the temporal Transformer. We show STAR outperforms the state-of-the-art models on 4 out of 5 real-world pedestrian trajectory prediction datasets and achieves comparable performance on the rest."



Paperid:535
Authors:Egor Zakharov, Aleksei Ivakhnenko, Aliaksandra Shysheya, Victor Lempitsky
Title: Fast Bi-layer Neural Synthesis of One-Shot Realistic Head Avatars
Abstract:
We propose a neural rendering-based system that creates head avatars from a single photograph. Our approach models a person's appearance by decomposing it into two layers. The first layer is a pose-dependent coarse image that is synthesized by a small neural network. The second layer is defined by a pose-independent texture image that contains high-frequency details. The texture image is generated offline, warped and added to the coarse image to ensure a high effective resolution of synthesized head views. We compare our system to analogous state-of-the-art systems in terms of visual quality and speed. The experiments show significant inference speedup over previous neural head avatar models for a given visual quality. We also report on a real-time smartphone-based implementation of our system."



Paperid:536
Authors:Jinwoo Lee, Minhyuk Sung, Hyunjoon Lee, Junho Kim
Title: Neural Geometric Parser for Single Image Camera Calibration
Abstract:
We propose a neural geometric parser learning single image camera calibration for man-made scenes. Unlike previous neural approaches that rely only on semantic cues obtained from neural networks, our approach considers both semantic and geometric cues, resulting in significant accuracy improvement. The proposed framework consists of two networks. Using line segments of an image as geometric cues, the first network estimates the zenith vanishing point and generates several candidates consisting of the camera rotation and focal length. The second network evaluates each candidate based on the given image and the geometric cues, where prior knowledge of man-made scenes is used for the evaluation. With the supervision of datasets consisting of the horizontal line and focal length of the images, our networks can be trained to estimate the same camera parameters. Based on the Manhattan world assumption, we can further estimate the camera rotation and focal length in a weakly supervised manner. The experimental results reveal that the performance of our neural approach is significantly higher than that of existing state-of-the-art camera calibration techniques for single images of indoor and outdoor scenes."



Paperid:537
Authors:Yuxiang Wei, Ming Liu, Haolin Wang, Ruifeng Zhu, Guosheng Hu, Wangmeng Zuo
Title: Learning Flow-based Feature Warping for Face Frontalization with Illumination Inconsistent Supervision
Abstract:
Despite recent advances in deep learning-based face frontalization methods, photo-realistic and illumination preserving frontal face synthesis is still challenging due to large pose and illumination discrepancy during training. We propose a novel Flow-based Feature Warping Model (FFWM) which can learn to synthesize photo-realistic and illumination preserving frontal images with illumination inconsistent supervision. Specifically, an Illumination Preserving Module (IPM) is proposed to learn illumination preserving image synthesis from illumination inconsistent image pairs. IPM includes two pathways which collaborate to ensure the synthesized frontal images are illumination preserving and with fine details. Moreover, a Warp Attention Module (WAM) is introduced to reduce the pose discrepancy in the feature level, and hence to synthesize frontal images more effectively and preserve more details of profile images. The attention mechanism in WAM helps reduce the artifacts caused by the displacements between the profile and the frontal images. Quantitative and qualitative experimental results show that our FFWM can synthesize photo-realistic and illumination preserving frontal images and performs favorably against state-of-the-art results."



Paperid:538
Authors:Dahyun Kim, Kunal Pratap Singh, Jonghyun Choi
Title: Learning Architectures for Binary Networks
Abstract:
Backbone architectures of most binary networks are well-known floating point architectures such as the ResNet family. Questioning that the architectures designed for floating point networks would not be the best for binary networks, we propose to search architectures for binary networks (BNAS) by defining a new search space for binary architectures and a novel search objective. Specifically, based on the cell based search method, we define the new search space of binary layer types, design a new cell template, and rediscover the utility of and propose to use the Zeroise layer instead of using it as a placeholder. The novel search objective diversifies early search to learn better performing binary architectures. We show that our proposed method searches architectures with stable training curves despite the quantization error inherent in binary networks. Quantitative analyses demonstrate that our searched architectures outperform the architectures used in state-of-the-art binary networks and outperform or perform on par with state-of-the-art binary networks that employ various techniques other than architectural changes."



Paperid:539
Authors:Hsin-Ping Huang, Hung-Yu Tseng, Hsin-Ying Lee, Jia-Bin Huang
Title: Semantic View Synthesis
Abstract:
We tackle a new problem of semantic view synthesis --- generating free-viewpoint rendering of a synthesized scene using a semantic label map as input. We build upon recent advances in semantic image synthesis and view synthesis for handling photographic image content generation and view extrapolation. Direct application of existing image/view synthesis methods, however, results in severe ghosting/blurry artifacts. To address the drawbacks, we propose a two-step approach. First, we focus on synthesizing the color and depth of the visible surface of the 3D scene. We then use the synthesized color and depth to impose explicit constraints on the multiple-plane image (MPI) representation prediction process. Our method produces sharp contents at the original view and geometrically consistent renderings across novel viewpoints. The experiments on numerous indoor and outdoor images show favorable results against several strong baselines and validate the effectiveness of our approach."



Paperid:540
Authors:Daichi Iwata, Michael Waechter, Wen-Yan Lin, Yasuyuki Matsushita
Title: An Analysis of Sketched IRLS for Accelerated Sparse Residual Regression
Abstract:
This paper studies the problem of sparse residual regression, i.e., learning a linear model using a norm that favors solutions in which the residuals are sparsely distributed. This is a common problem in a wide range of computer vision applications where a linear system has a lot more equations than unknowns and we wish to find the maximum feasible set of equations by discarding unreliable ones. We show that one of the most popular solution methods, iteratively reweighted least squares (IRLS), can be significantly accelerated by the use of matrix sketching. We analyze the convergence behavior of the proposed method and show its efficiency on a range of computer vision applications."



Paperid:541
Authors:Ivan Eichhardt, Daniel Barath
Title: Relative Pose from Deep Learned Depth and a Single Affine Correspondence
Abstract:
We propose a new approach for combining deep-learned nonmetric monocular depth with affine correspondences (ACs) to estimate the relative pose of two calibrated cameras from a single correspondence. Considering the depth information and affine features, two new constraints on the camera pose are derived. The proposed solver is usable within 1-point RANSAC approaches. Thus, the processing time of the robust estimation is linear in the number of correspondences and, therefore, orders of magnitude faster than by using traditional approaches. The proposed 1AC+D solver is tested both on synthetic data and on 110 395 publicly available real image pairs where we used an off-the-shelf monocular depth network to provide up-to-scale depth per pixel. The proposed 1AC+D leads to similar accuracy as traditional approaches while being significantly faster. When solving large-scale problems, e.g. pose-graph initialization for Structure-from-Motion (SfM) pipelines, the overhead of obtaining ACs and monocular depth is negligible compared to the speed-up gained in the pairwise geometric verification, i.e., relative pose estimation. This is demonstrated on scenes from the 1DSfM dataset using a state-of-the-art global SfM algorithm."



Paperid:542
Authors:Takashi Isobe, Xu Jia, Shuhang Gu, Songjiang Li, Shengjin Wang, Qi Tian
Title: Video Super-Resolution with Recurrent Structure-Detail Network
Abstract:
Most video super-resolution methods super-resolve a single reference frame with the help of neighboring frames in a temporal sliding window. They are less efficient compared to the recurrent-based methods. In this work, we propose a novel recurrent video super-resolution method which is both effective and efficient in exploiting previous frames to super-resolve the current frame. It divides the input into structure and detail components which are fed to a recurrent unit composed of several proposed two-stream structure-detail blocks. In addition, a hidden state adaptation module that allows the current frame to selectively use information from hidden state is introduced to enhance its robustness to appearance change and error accumulation. Extensive ablation study validate the effectiveness of the proposed modules. Experiments on several benchmark datasets demonstrate superior performance of the proposed method compared to state-of-the-art methods on video super-resolution."



Paperid:543
Authors:Shikun Liu, Zhe Lin, Yilin Wang, Jianming Zhang, Federico Perazzi, Edward Johns
Title: Shape Adaptor: A Learnable Resizing Module
Abstract:
We present a novel resizing module for neural networks: shape adaptor, a drop-in enhancement built on top of traditional resizing layers, such as pooling, bilinear sampling, and strided convolution. Whilst traditional resizing layers have fixed and deterministic reshaping factors, our module allows for a learnable reshaping factor. Our implementation enables shape adaptors to be trained end-to-end without any additional supervision, through which network architectures can be optimised for each individual task, in a fully automated way. We performed experiments across seven image classification datasets, and results show that by simply using a set of our shape adaptors instead of the original resizing layers, performance increases consistently over human-designed networks, across all datasets. Additionally, we show the effectiveness of shape adaptors on two other applications: network compression and transfer learning."



Paperid:544
Authors:Jinwoo Choi, Gaurav Sharma, Samuel Schulter, Jia-Bin Huang
Title: Shuffle and Attend: Video Domain Adaptation
Abstract:
We address the problem of domain adaptation in videos for the task of human action recognition. Inspired by image-based domain adaptation, we can perform video adaptation by aligning the features of frames or clips of source and target videos. However, equally aligning all clips is sub-optimal as not all clips are informative for the task. As the first novelty, we propose an attention mechanism which focuses on more discriminative clips and directly optimizes for video-level (cf. clip-level) alignment. As the backgrounds are often very different between source and target, the source background-corrupted model adapts poorly to target domain videos. To alleviate this, as a second novelty, we propose to use the clip order prediction as an auxiliary task. The clip order prediction loss, when combined with domain adversarial loss, encourages learning of representations which focus on the humans and objects involved in the actions, rather than the uninformative and widely differing (between source and target) backgrounds. We empirically show that both components contribute positively towards adaptation performance. We report state-of-the-art performances on two out of three challenging public benchmarks, two based on the UCF and HMDB datasets, and one on Kinetics to NEC-Drone datasets. We also support the intuitions and the results with qualitative results."



Paperid:545
Authors:Chen Gao, Jiarui Xu, Yuliang Zou, Jia-Bin Huang
Title: DRG: Dual Relation Graph for Human-Object Interaction Detection
Abstract:
We tackle the challenging problem of human-object interaction (HOI) detection. Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features. In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph (one human-centric and one object-centric). Our proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions. Our model is conceptually simple and leads to favorable results when compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets."



Paperid:546
Authors:Chen Gao, Ayush Saraf, Jia-Bin Huang, Johannes Kopf
Title: Flow-edge Guided Video Completion
Abstract:
We present a new flow-based video completion algorithm. Previous flow completion methods are often unable to retain the sharpness of motion boundaries. Our method first extracts and completes motion edges, and then uses them to guide piecewise-smooth flow completion with sharp edges. Existing methods propagate colors among local flow connections between adjacent frames. However, not all missing regions in a video can be reached in this way because the motion boundaries form impenetrable barriers. Our method alleviates this problem by introducing non-local flow connections to temporally distant frames, enabling propagating video content over motion boundaries. We validate our approach on the DAVIS dataset. Both visual and quantitative results show that our method compares favorably against the state-of-the-art algorithms."



Paperid:547
Authors:Ali Hatamizadeh, Debleena Sengupta, Demetri Terzopoulos
Title: End-to-End Trainable Deep Active Contour Models for Automated Image Segmentation: Delineating Buildings in Aerial Imagery
Abstract:
The automated segmentation of buildings in remote sensing imagery is a challenging task that requires the accurate delineation of multiple building instances over typically large image areas. Manual methods are often laborious and current deep-learning-based approaches fail to delineate all building instances and do so with adequate accuracy. As a solution, we present Trainable Deep Active Contours (TDACs), an automatic image segmentation framework that intimately unites Convolutional Neural Networks (CNNs) and Active Contour Models (ACMs). The Eulerian energy functional of the ACM component includes per-pixel parameter maps that are predicted by the backbone CNN, which also initializes the ACM. Importantly, both the ACM and CNN components are fully implemented in TensorFlow and the entire TDAC architecture is end-to-end automatically differentiable and backpropagation trainable without user intervention. TDAC yields fast, accurate, and fully automatic simultaneous delineation of arbitrarily many buildings in the image. We validate the model on two publicly available aerial image datasets for building segmentation, and our results demonstrate that TDAC establishes a new state-of-the-art performance."



Paperid:548
Authors:Seonwook Park, Emre Aksan, Xucong Zhang, Otmar Hilliges
Title: Towards End-to-end Video-based Eye-Tracking
Abstract:
Estimating eye-gaze from images alone is a challenging task, in large parts due to un-observable person-specific factors. Achieving high accuracy typically requires labeled data from test users which may not be attainable in real applications. We observe that there exists a strong relationship between what users are looking at and the appearance of the user’s eyes. In response to this understanding, we propose a novel dataset and accompanying method which aims to explicitly learn these semantic and temporal relationships. Our video dataset consists of time-synchronized screen recordings, user-facing camera views, and eye gaze data, which allows for new benchmarks in temporal gaze tracking as well as label-free refinement of gaze. Importantly, we demonstrate that the fusion of information from visual stimuli as well as eye images can lead towards achieving performance similar to literature-reported figures acquired through supervised personalization. Our final method yields significant performance improvements on our proposed EVE dataset, with up to 28% improvement in Point-of-Gaze estimates (resulting in 2.49-deg in angular error), paving the path towards high-accuracy screen-based eye tracking purely from webcam sensors. The dataset and reference source code are available at https://ait.ethz.ch/projects/2020/EVE"



Paperid:549
Authors:Atsunobu Kotani, Stefanie Tellex, James Tompkin
Title: Generating Handwriting via Decoupled Style Descriptors
Abstract:
Representing a space of handwriting stroke styles includes the challenge of representing both the style of each character and the overall style of the human writer. Existing VRNN approaches to representing handwriting often do not distinguish between these different style components, which can reduce model capability. Instead, we introduce the Decoupled Style Descriptor (DSD) model for handwriting, which factors both character- and writer-level styles and allows our model to represent an overall greater space of styles. This approach also increases flexibility: given a few examples, we can generate handwriting in new writer styles, and also now generate handwriting of new characters across writer styles. In experiments, our generated results were preferred over a state of the art baseline method 88% of the time, and in a writer identification task on 20 held-out writers, our DSDs achieved 89.38% accuracy from a single sample word. Overall, DSDs allows us to improve both the quality and flexibility over existing handwriting stroke generation approaches."



Paperid:550
Authors:Rongliang Wu, Shijian Lu
Title: LEED: Label-Free Expression Editing via Disentanglement
Abstract:
Recent studies on facial expression editing have obtained very promising progress. On the other hand, existing methods face the constraint of requiring a large amount of expression labels which are often expensive and time-consuming to collect. This paper presents an innovative label-free expression editing via disentanglement (LEED) framework that is capable of editing the expression of both frontal and profile facial images without requiring any expression labels. The idea is to disentangle the identity and expression of a facial image in the expression manifold, where the neutral face captures the identity attribute and the displacement between the neutral image and the expressive image captures the expression attribute. Two novel losses are designed for optimal expression disentanglement and consistent synthesis, including a mutual expression information loss that aims to extract pure expression-related features and a siamese loss that aims to enhance the expression similarity between the synthesized image and the reference image. Extensive experiments over two public facial expression datasets show that LEED achieves superior facial expression editing qualitatively and quantitatively."



Paperid:551
Authors:Xuewen Yang, Heming Zhang, Di Jin, Yingru Liu, Chi-Hao Wu, Jianchao Tan, Dongliang Xie, Jue Wang, Xin Wang
Title: Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards
Abstract:
Generating accurate descriptions for online fashion items is important not only for enhancing customers' shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 800K images and 120K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model."



Paperid:552
Authors:Gouthaman KV, Anurag Mittal
Title: Reducing Language Biases in Visual Question Answering with Visually-Grounded Question Encoder
Abstract:
Recent studies have shown that current VQA models are heavily biased on the language priors in the train set to answer the question, irrespective of the image. E.g., overwhelmingly answer ""what sport is"" as ""tennis"" or ""what color banana"" as ""yellow."" This behavior restricts them from real-world application scenarios. In this work, we propose a novel model-agnostic question encoder, Visually-Grounded Question Encoder (VGQE), for VQA that reduces this effect. VGQE utilizes both visual and language modalities equally while encoding the question. Hence the question representation itself gets sufficient visual-grounding, and thus reduces the dependency of the model on the language priors. We demonstrate the effect of VGQE on three recent VQA models and achieve state-of-the-art results on the bias-sensitive split of the VQAv2 dataset; VQA-CPv2. Further, unlike the existing bias-reduction techniques, on the standard VQAv2 benchmark, our approach does not drop the accuracy; instead, it improves the performance. "



Paperid:553
Authors:Jogendra Nath Kundu, Ambareesh Revanur, Govind Vitthal Waghmare, Rahul Mysore Venkatesh, R. Venkatesh Babu
Title: Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation
Abstract:
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation. We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation. This is realized by learning a generative pose embedding which not only ensures plausible 3D pose predictions, but also eliminates the usual keypoint grouping operation as employed in prior bottom-up approaches. Further, we propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable. In the absence of any paired supervision, we leverage a frozen network, as a teacher model, which is trained on an auxiliary task of multi-person 2D pose estimation. We cast the learning as a cross-modal alignment problem and propose training objectives to realize a shared latent space between two diverse modalities. We aim to enhance the model's ability to perform beyond the limiting teacher network by enriching the latent-to-3D pose mapping using artificially synthesized multi-person 3D scene samples. Our approach not only generalizes to in-the-wild images, but also yields a superior trade-off between speed and performance, compared to prior top-down approaches. Our approach also yields state-of-the-art multi-person 3D pose estimation performance among the bottom-up approaches under consistent supervision levels."



Paperid:554
Authors:Jogendra Nath Kundu, Rahul Mysore Venkatesh, Naveen Venkat, Ambareesh Revanur, R. Venkatesh Babu
Title: Class-Incremental Domain Adaptation
Abstract:
We introduce a practical Domain Adaptation (DA) paradigm called Class-Incremental Domain Adaptation (CIDA). Existing DA methods tackle domain-shift but are unsuitable for learning novel target-domain classes. Meanwhile, class-incremental (CI) methods enable learning of new classes in absence of source training data, but fail under a domain-shift without labeled supervision. In this work, we effectively identify the limitations of these approaches in the CIDA paradigm. Motivated by theoretical and empirical observations, we propose an effective method, inspired by prototypical networks, that enables classification of target samples into both shared and novel (one-shot) target classes, even under a domain-shift. Our approach yields superior performance as compared to both DA and CI methods in the CIDA paradigm."



Paperid:555
Authors:Hanlin Chen, Baochang Zhang, Song Xue, Xuan Gong, Hong Liu, Rongrong Ji, David Doermann
Title: Anti-Bandit Neural Architecture Search for Model Defense
Abstract:
Deep convolutional neural networks (DCNNs) have dominated as the best performers in machine learning, but can be challenged by adversarial attacks. In this paper, we defend against adversarial attacks using neural architecture search (NAS) which is based on a comprehensive search of denoising blocks, weight-free operations, Gabor filters and convolutions. The resulting anti-bandit NAS (ABanditNAS) incorporates a new operation evaluation measure and search process based on the lower and upper confidence bounds (LCB and UCB). Unlike the conventional bandit algorithm using UCB for evaluation only, we use UCB to abandon arms for search efficiency and LCB for a fair competition between arms. Extensive experiments demonstrate that ABanditNAS is about twice as fast as the state-of-the-art NAS method, while achieving an $8.73\%$ improvement over prior arts on CIFAR-10 under PGD-$7$."



Paperid:556
Authors:Lin Liu, Jianzhuang Liu, Shanxin Yuan, Gregory Slabaugh, Aleš Leonardis, Wengang Zhou, Qi Tian
Title: Wavelet-Based Dual-Branch Network for Image Demoiréing
Abstract:
When smartphone cameras are used to take photos of digital screens, usually moire patterns result, severely degrading photo quality. In this paper, we design a wavelet-based dual-branch network (WDNet) with a spatial attention mechanism for image demoireing. Existing image restoration methods working in the RGB domain have difficulty in distinguishing moire patterns from true scene texture. Unlike these methods,our network removes moire patterns in the wavelet domain to separate the frequencies of moire patterns from the image content. The network combines dense convolution modules and dilated convolution modules supporting large receptive fields. Extensive experiments demonstrate the effectiveness of our method, and we further show that WDNet generalizes to removing moire artifacts on non-screen images. Although designed for image demoireing, WDNet has been applied to two other low-level vision tasks, outperforming state-of-the-art image deraining and derain-drop methods on the Rain100h and Raindrop800 data sets, respectively."



Paperid:557
Authors:Danai Triantafyllidou, Sean Moran, Steven McDonagh, Sarah Parisot, Gregory Slabaugh
Title: Low Light Video Enhancement using Synthetic Data Produced with an Intermediate Domain Mapping
Abstract:
Advances in low-light video RAW-to-RGB translation are opening up the possibility of fast low-light imaging on commodity devices (e.g. smartphone cameras) without the need for a tripod. However,it is challenging to collect the required paired short-long exposure frames to learn a supervised mapping. Current approaches require a specialised rig or the use of static videos with no subject or object motion [5],resulting in datasets that are limited in size, diversity, and motion. We address the data collection bottleneck for low-light video RAW-to-RGB by proposing a data synthesis mechanism, dubbed SIDGAN, that can generate abundant dynamic video training pairs. SIDGAN maps videos found ‘in the wild’ (e.g. internet videos) into a low-light (short, long exposure) domain. By generating dynamic video data synthetically, we enable a recently proposed state-of-the-art RAW-to-RGB model to attain higher image quality (improved colour, reduced artifacts) and improved temporal consistency, compared to the same model trained with only static real video data"



Paperid:558
Authors:Jinsun Park, Kyungdon Joo, Zhe Hu, Chi-Kuei Liu, In So Kweon
Title: Non-Local Spatial Propagation Network for Depth Completion
Abstract:
In this paper, we propose a robust and efficient end-to-end non-local spatial propagation network for depth completion. The proposed network takes RGB and sparse depth images as inputs and estimates non-local neighbors and their affinities of each pixel, as well as an initial depth map with pixel-wise confidences. The initial depth prediction is then iteratively refined by its confidence and non-local spatial propagation procedure based on the predicted non-local neighbors and corresponding affinities. Unlike previous algorithms that utilize fixed-local neighbors, the proposed algorithm effectively avoids irrelevant local neighbors and concentrates on relevant non-local neighbors during propagation. In addition, we introduce a learnable affinity normalization to better learn the affinity combinations compared to conventional methods. The proposed algorithm is inherently robust to the mixed-depth problem on depth boundaries, which is one of the major issues for existing depth estimation/completion algorithms. Experimental results on indoor and outdoor datasets demonstrate that the proposed algorithm is superior to conventional algorithms in terms of depth completion accuracy and robustness to the mixed-depth problem."



Paperid:559
Authors:Lvmin Zhang, Yi JI, Chunping Liu
Title: DanbooRegion: An Illustration Region Dataset
Abstract:
Region is a fundamental element of various cartoon animation techniques and artistic painting applications. Achieving satisfactory region is essential to the success of these techniques. Motivated to assist diversiform region-based cartoon applications, we invite artists to annotate regions for in-the-wild cartoon images with several application-oriented goals: (1) To assist image-based cartoon rendering, relighting, and cartoon intrinsic decomposition literature, artists identify object outlines and eliminate lighting-and-shadow boundaries. (2) To assist cartoon inking tools, cartoon structure extraction applications, and cartoon texture processing techniques, artists clean-up texture or deformation patterns and emphasize cartoon structural boundary lines. (3) To assist region-based cartoon digitalization, clip-art vectorization, and animation tracking applications, artists inpaint and reconstruct broken or blurred regions in cartoon images. Given the typicality of these involved applications, this dataset is also likely to be used in other cartoon techniques. We detail the challenges in achieving this dataset and present a human-in-the-loop workflow namely Feasibility-based Assignment Recommendation (FAR) to enable large-scale annotating. The FAR tends to reduce artist trails-and-errors and encourage their enthusiasm during annotating. Finally, we present a dataset that contains a large number of artistic region compositions paired with corresponding cartoon illustrations. We also invite multiple professional artists to assure the quality of each annotation."



Paperid:560
Authors:Bishan Wang, Jingwei He, Lei Yu, Gui-Song Xia, Wen Yang
Title: Event Enhanced High-Quality Image Recovery
Abstract:
With extremely high temporal resolution, event cameras have a large potential for robotics and computer vision. However, their asynchronous imaging mechanism often aggravates the measurement sensitivity to noises and brings a physical burden to increase the image spatial resolution. To recover high-quality intensity images, one should address both denoising and super-resolution problems for event cameras. Since events depict brightness changes, with the enhanced degeneration model by the events, the clear and sharp high-resolution latent images can be recovered from the noisy, blurry and low-resolution intensity observations. Exploiting the framework of sparse learning, the events and the low-resolution intensity observations can be jointly considered. Based on this, we propose an explainable network, an event-enhanced sparse learning network (eSL-Net), to recover the high-quality images from event cameras. After training with a synthetic dataset, the proposed eSL-Net can largely improve the performance of the state-of-the-art by 7-12 dB. Furthermore, without additional training process, the proposed eSL-Net can be easily extended to generate continuous frames with frame-rate as high as the events. "



Paperid:561
Authors:Kun Ding, Guojin He, Huxiang Gu, Zisha Zhong, Shiming Xiang, Chunhong Pan
Title: PackDet: Packed Long-Head Object Detector
Abstract:
State-of-the-art object detectors exploit multi-branch structure and predict objects at several different scales, although substantially boosted accuracy is acquired, low efficiency is inevitable as fragmented structure is hardware unfriendly. To solve this issue, we propose a packing operator (PackOp) to combine all head branches together at spatial. Packed features are computationally more efficient and allow to use cross-head group normalization (GN) at handy, leading to notable accuracy improvement against the common head-separate GN. All of these are only at the cost of less than 5.7% relative increase on runtime memory and introduction of a few noisy training samples, however, whose side-effects could be diminished by good packing patterns design. With PackOp, we propose a new anchor-free one-stage detector, PackDet, which features a single deeper/longer but narrower head compared to the existing methods: multiple shallow but wide heads. Our best models on COCO test-dev achieve better speed-accuracy balance: 35.1%, 42.3%, 44.0%, 47.4% AP with 22.6, 16.9, 12.4, 4.7 FPS using MobileNet-v2, ResNet-50, ResNet-101, and ResNeXt-101-DCN backbone, respectively. Codes will be released."



Paperid:562
Authors:Xuefei Ning, Yin Zheng, Tianchen Zhao, Yu Wang, Huazhong Yang
Title: A Generic Graph-based Neural Architecture Encoding Scheme for Predictor-based NAS
Abstract:
This work proposes a novel Graph-based neural ArchiTecture Encoding Scheme, a.k.a. GATES, to improve the predictor-based neural architecture search. Specifically, different from existing graph-based schemes, GATES models the operations as the transformation of the propagating information, which mimics the actual data processing of neural architecture. GATES is a more reasonable modeling of the neural architectures, and can encode architectures from both the ""operation on node"" and ""operation on edge"" cell search spaces consistently. Experimental results on various search spaces confirm GATES's effectiveness in improving the performance predictor. Furthermore, equipped with the improved performance predictor, the sample efficiency of the predictor-based neural architecture search (NAS) flow is boosted."



Paperid:563
Authors:Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, Siwei Lyu
Title: Learning Semantic Neural Tree for Human Parsing
Abstract:
In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and design a coarse to fine process in a cascade manner to generate accurate results. Specifically, the semantic neural tree is designed to segment human regions into multiple semantic sub-regions (g, face, arms, and legs) in a hierarchical way using a new designed attention routing module. Meanwhile, we introduce the semantic aggregation module to combine multiple hierarchical features to exploit more context information for better performance. Our semantic neural tree can be trained in an end-to-end fashion by standard stochastic gradient descent (SGD) with back-propagation. Several experiments conducted on four challenging datasets for both single and multiple human parsing, \ie, LIP, PASCAL-Person-Part, CIHP and MHP-v2, demonstrate the effectiveness of the proposed method. Code can be found at \url{https://isrc.iscas.ac.cn/gitlab/research/sematree}."



Paperid:564
Authors:Wenbin Wang, Ruiping Wang, Shiguang Shan, Xilin Chen
Title: Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation
Abstract:
Scene graph aims to faithfully reveal humans' perception of image content. When humans analyze a scene, they usually prefer to describe image gist first, namely major objects and key relations in a scene graph. This humans' inherent perceptive habit implies that there exists a hierarchical structure about humans' preference during the scene parsing procedure. Therefore, we argue that a desirable scene graph should be also hierarchically constructed, and introduce a new scheme for modeling scene graph. Concretely, a scene is represented by a human-mimetic Hierarchical Entity Tree (HET) consisting of a series of image regions. To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context to capture the structured information embedded in HET. To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings by learning to capture humans' subjective perceptive habits from objective entity saliency and size. Experiments indicate that our method not only achieves state-of-the-art performances for scene graph generation, but also is expert in mining image-specific relations which play a great role in serving downstream tasks. "



Paperid:565
Authors:Xuejian Rong, Denis Demandolx, Kevin Matzen, Priyam Chatterjee, Yingli Tian
Title: Burst Denoising via Temporally Shifted Wavelet Transforms
Abstract:
Mobile photography has made great strides in recent years. However, low light imaging still remains a challenge. Long exposures can improve signal-to-noise ratio (SNR) but undesirable motion blur can occur when capturing dynamic scenes. As a result, imaging pipelines often rely on computational photography to improve SNR by fusing multiple short exposures. Recent deep neural network-based methods have been shown to generate visually pleasing results by fusing these exposures in a sophisticated manner, but often at a higher computational cost.We propose an end-to-end trainable burst denoising pipeline which jointly captures high-resolution and high-frequency deep features derived from wavelet transforms. In our model, precious local details are preserved in high-frequency sub-band features to enhance the final perceptual quality, while the low-frequency sub-band features carry structural information for faithful reconstruction and final objective quality. The model is designed to accommodate variable-length burst captures via temporal feature shifting while incurring only marginal computational overhead. Lastly, we train our model with a realistic noise model for the generalization to real environments. Using these techniques, our method attains state-of-the-art performance on perceptual quality, while being an order of magnitude faster."



Paperid:566
Authors:Fengze Liu, Jinzheng Cai, Yuankai Huo, Chi-Tung Cheng, Ashwin Raju, Dakai Jin, Jing Xiao, Alan Yuille, Le Lu, ChienHung Liao, Adam P. Harrison
Title: JSSR: A Joint Synthesis, Segmentation, and Registration System for 3D Multi-Modal Image Alignment of Large-scale Pathological CT Scans
Abstract:
Segmentation, and Registration System for 3D Multi-Modal Image Alignment of Large-scale Pathological CT Scans","Multi-modal image registration is a challenging problem that is also an important clinical task for many real applications and scenarios. As a first step in analysis, deformable registration among different image modalities is often required in order to provide complementary visual information. During registration, semantic information is key to match homologous points and pixels. Nevertheless, many conventional registration methods are incapable in capturing high-level semantic anatomical dense correspondences. In this work, we propose a novel multi-task learning system, JSSR, based on an end-to-end 3D convolutional neural network that is composed of a generator, a registration and a segmentation component. The system is optimized to satisfy the implicit constraints between different tasks in an unsupervised manner. It first synthesizes the source domain images into the target domain, then an intra-modal registration is applied on the synthesized images and target images. The segmentation module are then applied on the synthesized and target images, providing additional cues based on semantic correspondences. The supervision from another fully-annotated dataset is used to regularize the segmentation. We extensively evaluate JSSR on a large-scale medical image dataset containing 1,485 patient CT imaging studies of four different contrast phases (i.e., 5,940 3D CT scans with pathological livers) on the registration, segmentation and synthesis tasks. The performance is improved after joint training on the registration and segmentation tasks by 0.9% and 1.9% respectively compared to a highly competitive and accurate deep learning baseline. The registration also consistently outperforms conventional state-of-the-art multi-modal registration methods."



Paperid:567
Authors:Junwei Liang, Lu Jiang, Alexander Hauptmann
Title: SimAug: Learning Robust Representations from Simulation for Trajectory Prediction
Abstract:
This paper studies the problem of predicting future trajectories of people in unseen cameras of novel scenarios and views. We approach this problem through the real-data-free setting in which the model is trained only on 3D simulation data and applied out-of-the-box to a wide variety of real cameras. We propose a novel approach to learn robust representation through augmenting the simulation training data such that the representation can better generalize to unseen real-world test data. The key idea is to mix the feature of the hardest camera view with the adversarial feature of the original view. We refer to our method as $ extit{SimAug}$. We show that $ extit{SimAug}$ achieves promising results on three real-world benchmarks using zero real training data, and state-of-the-art performance in the Stanford Drone and the VIRAT/ActEV dataset when using in-domain training data. Code and models are released at https://next.cs.cmu.edu/simaug"



Paperid:568
Authors:Bowen Chen, Huan Ling, Xiaohui Zeng, Jun Gao, Ziyue Xu, Sanja Fidler
Title: ScribbleBox: Interactive Annotation Framework for Video Object Segmentation
Abstract:
Manually labeling video datasets for segmentation tasks is extremely time consuming. We introduce ScribbleBox, an interactive framework for annotating object instances with masks in videos with a significant boost in efficiency. In particular, we split annotation into two steps: annotating objects with tracked boxes, and labeling masks inside these tracks. We introduce automation and interaction in both steps. Box tracks are annotated efficiently by approximating the trajectory using a parametric curve with a small number of control points which the annotator can interactively correct. Our approach tolerates a modest amount of noise in box placements, thus typically requiring only a few clicks to annotate a track to a sufficient accuracy. Segmentation masks are corrected via scribbles which are propagated through time. We show significant performance gains in annotation efficiency over past work. We show that our ScribbleBox approach reaches 88.92% J&F on DAVIS2017 with an average of 9.14 clicks per box track, and only 4 frames requiring scribble annotation in a video of 65.3 frames on average."



Paperid:569
Authors:Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, Wanli Ouyang
Title: Rethinking Pseudo-LiDAR Representation
Abstract:
The recently proposed pseudo-LiDAR based 3D detectors greatly improves the benchmark of monocular/stereo 3D detection task. However, the underlying mechanism is still obscure to the research community. In this paper, we perform an in-depth investigation and observe that the pseudo-LiDAR representation is effective because of the coordinate transformation, instead of data representation itself. Based on this observation, we design an image based CNN detector named PatchNet, which is a generalized version that can represent pseudo-LiDAR based 3D detectors. In PatchNet, we organize pseudo-LiDAR data as the image representation, which means existing 2D CNN designs can be easily utilized for extracting deep features from input data and boosting 3D detection performance. We conduct extensive experiments on the challenging KITTI dataset, where the proposed PatchNet outperforms all existing pseudo-LiDAR based counterparts. Co de has been made available at: https://github.com/xinzhuma/patchnet"



Paperid:570
Authors:Kai-En Lin, Zexiang Xu, Ben Mildenhall, Pratul P. Srinivasan, Yannick Hold-Geoffroy, Stephen DiVerdi, Qi Sun, Kalyan Sunkavalli, Ravi Ramamoorthi
Title: Deep Multi Depth Panoramas for View Synthesis
Abstract:
We propose a learning-based approach for novel view synthesis for multi-camera 360$^



Paperid:571
Authors:Fa-Ting Hong, Xuanteng Huang, Wei-Hong Li, Wei-Shi Zheng
Title: MINI-Net: Multiple Instance Ranking Network for Video Highlight Detection
Abstract:
We address the weakly supervised video highlight detection problem for learning to detect segments that are more attractive in training videos given their video event label but without expensive supervision of manually annotating highlight segments. While manually averting localizing highlight segments, weakly supervised modeling is challenging, as a video in our daily life could contain highlight segments with multiple event types, e.g., skiing and surfing. In this work, we propose casting weakly supervised video highlight detection modeling for a given specific event as a multiple instance ranking network (MINI-Net) learning. We consider each video as a bag of segments, and therefore, the proposed MINI-Net learns to enforce a higher highlight score for a positive bag that contains highlight segments of a specific event than those for negative bags that are irrelevant. In particular, we form a max-max ranking loss to acquire a reliable relative comparison between the most likely positive segment instance and the hardest negative segment instance. With this max-max ranking loss, our MINI-Net effectively leverages all segment information to acquire a more distinct video feature representation for localizing the highlight segments of a specific event in a video. The extensive experimental results on three challenging public benchmarks clearly validate the efficacy of our multiple instance ranking approach for solving the problem."



Paperid:572
Authors:Samarth Brahmbhatt, Chengcheng Tang, Christopher D. Twigg, Charles C. Kemp, James Hays
Title: ContactPose: A Dataset of Grasps with Object Contact and Hand Pose
Abstract:
Grasping is natural for humans. However, it involves complex hand configurations and soft tissue deformation that can result in complicated regions of contact between the hand and the object. Understanding and modeling this contact can potentially improve hand models, AR/VR experiences, and robotic grasping. Yet, we currently lack datasets of hand-object contact paired with other data modalities, which is crucial for developing and evaluating contact modeling techniques. We introduce ContactPose, the first dataset of hand-object contact paired with hand pose, object pose, and RGB-D images. ContactPose has 2306 unique grasps of 25 household objects grasped with 2 functional intents by 50 participants, and more than 2.9 M RGB-D grasp images. Analysis of ContactPose data reveals interesting relationships between hand pose and contact. We use this data to rigorously evaluate various data representations, heuristics from the literature, and learning methods for contact modeling. Data, code, and trained models are available at https://contactpose.cc.gatech.edu."



Paperid:573
Authors:Xinshuai Dong, Hong Liu, Rongrong Ji, Liujuan Cao, Qixiang Ye, Jianzhuang Liu, Qi Tian
Title: API-Net: Robust Generative Classifier via a Single Discriminator
Abstract:
Robustness of deep neural network classifiers has been attracting increased attention. As for the robust classification problem, a generative classifier typically models the distribution of inputs and labels, and thus can better handle off-manifold examples at the cost of a concise structure. On the contrary, a discriminative classifier only models the conditional distribution of labels given inputs, but benefits from effective optimization owing to its succinct structure. This work aims for a solution of generative classifiers that can profit from the merits of both. To this end, we propose an Anti-Perturbation Inference (API) method, which searches for anti-perturbations to maximize the lower bound of the joint log-likelihood of inputs and classes. By leveraging the lower bound to approximate Bayes' rule, we construct a generative classifier Anti-Perturbation Inference Net (API-Net) upon a single discriminator. It takes advantage of the generative properties to tackle off-manifold examples while maintaining a succinct structure for effective optimization. Experiments show that API successfully neutralizes adversarial perturbations, and API-Net consistently outperforms state-of-the-art defenses on prevailing benchmarks, including CIFAR-10, MNIST, and SVHN. "



Paperid:574
Authors:Aishan Liu, Jiakai Wang, Xianglong Liu, Bowen Cao, Chongzhi Zhang, Hang Yu
Title: Bias-based Universal Adversarial Patch Attack for Automatic Check-out
Abstract:
Adversarial examples are inputs with imperceptible perturbations that easily misleading deep neural networks (DNNs). Recently, adversarial patch, with noise confined to a small and localized patch, has emerged for its easy feasibility in real-world scenarios. However, existing strategies failed to generate adversarial patches with strong generalization ability. In other words, the adversarial patches were input-specific and failed to attack images from all classes, especially unseen ones during training. To address the problem, this paper proposes a bias-based framework to generate class-agnostic universal adversarial patches with strong generalization ability, which exploits both the perceptual and semantic bias of models. Regarding the perceptual bias, since DNNs are strongly biased towards textures, we exploit the hard examples which convey strong model uncertainties and extract a textural patch prior from them by adopting the style similarities. The patch prior is more close to decision boundaries and would promote attacks. To further alleviate the heavy dependency on large amounts of data in training universal attacks, we further exploit the semantic bias. As the class-wise preference, prototypes are introduced and pursued by maximizing the multi-class margin to help universal training. Taking Automatic Check-out (ACO) as the typical scenario, extensive experiments including white-box/black-box settings in both digital-world (RPC, the largest ACO related dataset) and physical-world scenario (Taobao and JD, the world’s largest online shopping platforms) are conducted. Experimental results demonstrate that our proposed framework outperforms state-of-the-art adversarial patch attack methods. "



Paperid:575
Authors:Chris Dongjoo Kim, Jinseo Jeong, Gunhee Kim
Title: Imbalanced Continual Learning with Partitioning Reservoir Sampling
Abstract:
Continual learning from a sequential stream of data is a crucial challenge for machine learning research. Most studies have been conducted on this topic under the single-label classification setting along with an assumption of balanced label distribution. This work expands this research horizon towards multi-label classification. In doing so, we identify unanticipated adversity innately existent in many multilabel datasets, the long-tailed distribution. We jointly address the two independently solved problems, Catastropic Forgetting and the long-tailed label distribution by first empirically showing a new challenge of destructive forgetting of the minority concepts on the tail. Then, we curate two benchmark datasets, COCOseq and NUS-WIDEseq, that allow the study of both intra- and inter-task imbalances. Lastly, we propose a new sampling strategy for replay-based approach named Partitioning Reservoir Sampling (PRS), which allows the model to maintain a balanced knowledge of both head and tail classes. We publicly release the dataset and the code in our project page."



Paperid:576
Authors:Zhanghan Ke, Di Qiu, Kaican Li, Qiong Yan, Rynson W.H. Lau
Title: Guided Collaborative Training for Pixel-wise Semi-Supervised Learning
Abstract:
We investigate the generalization of semi-supervised learning (SSL) to diverse pixel-wise tasks. Although SSL methods have achieved impressive results in image classification, the performances of applying them to pixel-wise tasks are unsatisfactory due to their need for dense outputs. In addition, existing pixel-wise SSL approaches are only suitable for certain tasks as they usually require to use task-specific properties. In this paper, we present a new SSL framework, named Guided Collaborative Training (GCT), for pixel-wise tasks, with two main technical contributions. First, GCT addresses the issues caused by the dense outputs through a novel flaw detector. Second, the modules in GCT learn from unlabeled data collaboratively through two newly proposed constraints that are independent of task-specific properties. As a result, GCT can be applied to a wide range of pixel-wise tasks without structural adaptation. Our extensive experiments on four challenging vision tasks, including semantic segmentation, real image denoising, portrait image matting, and night image enhancement, show that GCT outperforms state-of-the-art SSL methods by a large margin."



Paperid:577
Authors:Haixin Wang, Tianhao Zhang, Muzhi Yu, Jinan Sun, Wei Ye, Chen Wang , Shikun Zhang
Title: Stacking Networks Dynamically for Image Restoration Based on the Plug-and-Play Framework
Abstract:
Recently, stacked networks show powerful performance in Image Restoration, such as challenging motion deblurring problems. However, the number of stacking levels is a hyper-parameter fine-tuned manually, making the stacking levels static during training without theoretical explanations for optimal settings. To address this challenge, we leverage the iterative process of the traditional plug-and-play method to provide a dynamic stacked network for Image Restoration. Specifically, a new degradation model with a novel update scheme is designed to integrate the deep neural network as the prior within the plug-and-play model. Compared with static stacked networks, our models are stacked dynamically during training via iterations, guided by a solid mathematical explanation. Theoretical proof on the convergence of the dynamic stacking process is provided. Experiments on the noise dataset BSD68, Set12, and motion blur dataset GoPro demonstrate that our framework outperforms the state-of-the-art in terms of PSNR and SSIM score without extra training process."



Paperid:578
Authors:Ming Sun, Haoxuan Dou, Junjie Yan
Title: Efficient Transfer Learning via Joint Adaptation of Network Architecture and Weight
Abstract:
Transfer learning can boost the performance on the target task by leveraging the knowledge of the source domain. Recent works in neural architecture search (NAS), especially one-shot NAS, can aid transfer learning by establishing sufficient network search space. However, existing NAS methods tend to approximate huge search spaces by explicitly building giant super-networks with multiple sub-paths, and discard super-network weights after a child structure is found. Both the characteristics of existing approaches causes repetitive network training on source tasks in transfer learning. To remedy the above issues, we reduce the super-network size by randomly dropping connection between network blocks while embedding a larger search space. Moreover, we reuse super-network weights to avoid redundant training by proposing a novel framework consisting of two modules, the neural architecture search module for architecture transfer and the neural weight search module for weight transfer. These two modules conduct search on the target task based on a reduced super-networks, so we only need to train once on the source task. We experiment our framework on both MS-COCO and CUB-200 for the object detection and fine-grained image classification tasks, and show promising improvements with only $O(C^{N})$ super-network complexity."



Paperid:579
Authors:Congcong Li, Dawei Du, Libo Zhang, Longyin Wen, Tiejian Luo, Yanjun Wu, Pengfei Zhu
Title: Spatial Attention Pyramid Network for Unsupervised Domain Adaptation
Abstract:
Unsupervised domain adaptation is critical in various computer vision tasks, such as object detection, instance segmentation, and semantic segmentation, which aims to alleviate performance degradation caused by domain-shift. Most of previous methods rely on a single-mode distribution of source and target domains to align them with adversarial learning, leading to inferior results in various scenarios. To that end, in this paper, we design a new spatial attention pyramid network for unsupervised domain adaptation. Specifically, we first build the spatial pyramid representation to capture context information of objects at different scales. Guided by the task-specific information, we combine the dense global structure representation and local texture patterns at each spatial location effectively using the spatial attention mechanism. In this way, the network is enforced to focus on the discriminative regions with context information for domain adaption. We conduct extensive experiments on various challenging datasets for unsupervised domain adaptation on object detection, instance segmentation, and semantic segmentation, which demonstrates that our method performs favorably against the state-of-the-art methods by a large margin. Our source code is available at https://isrc.iscas.ac.cn/gitlab/research/domain-adaption."



Paperid:580
Authors:Jianren Wang, Zhaoyuan Fang
Title: GSIR: Generalizable 3D Shape Interpretation and Reconstruction
Abstract:
Single image 3D shape interpretation and reconstruction are closely related to each other but have long been studied separately and often end up with priors that are highly biased by training classes. Here we present an algorithm extit{(GSIR)}, designed to joint learning these two tasks to capture generic, class-agnostic shape priors for a better understanding of 3D geometry. We propose to recover 3D shape structures as cuboids from partially reconstructed objects and use the predicted structures to further guide 3D reconstruction. The unified framework is trained simultaneously offline to learn a generic notion and can be fine-tuned online for specific objects without any annotations. Extensive experiments on both synthetic and real data demonstrate that introducing 3D shape interpretation improves the performance of 3D reconstruction and vice versa, against the state-of-the-art algorithms on both seen and unseen categories."



Paperid:581
Authors:Qinghao Meng, Wenguan Wang, Tianfei Zhou, Jianbing Shen, Luc Van Gool , Dengxin Dai
Title: Weakly Supervised 3D Object Detection from Lidar Point Cloud
Abstract:
It is laborious to manually label point cloud data for training high-quality 3D object detectors. This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes, associated with a few precisely labeled object instances. This is achieved by a two-stage architecture design. Stage-1 learns to generate cylindrical object proposals under weak supervision, i.e., only the horizontal centers of objects are click-annotated on bird’s view scenes. Stage-2 learns to refine the cylindrical proposals to get cuboids and confidence scores, using a few well-labeled object instances. Using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 85∼95% of performance of current top-leading, fully supervised detectors (requiring 3,712 exhaustively and precisely annotated scenes with 15,654 instances). Moreover, with our elaborately designed network architecture, our trained model can be applied as a 3D object annotator, supporting both automatic and active (human-in-the-loop) working modes. The annotations generated by our model can be used to train 3D object detectors achieving 94% of their original performance (with manually labeled data). Our experiments also show our model's potential in boosting performance when given more training data. Above designs make our approach highly practical and introduce new opportunities for learning 3D object detection at reduced annotation cost."



Paperid:582
Authors:Inkyu Shin, Sanghyun Woo, Fei Pan, In So Kweon
Title: Two-phase Pseudo Label Densification for Self-training based Domain Adaptation
Abstract:
Recently, deep self-training approaches emerged as a powerful solution to the unsupervised domain adaptation. The self-training scheme involves iterative processing of target data; it generates target pseudo labels and retrains the network. However, since only the confident predictions are taken as pseudo labels, existing self-training approaches inevitably produce sparse pseudo labels in practice. We see this is critical because the resulting insufficient training-signals lead to a sub-optimal, error-prone model. In order to tackle this problem, we propose a novel Two-phase Pseudo Label Densification framework, referred to as TPLD. In the first phase, we use sliding window voting to propagate the confident predictions, utilizing inherent spatial-correlations in the images. In the second phase, we perform a confidence-based easy-hard classification. For the easy samples, we now employ their full pseudo-labels. For the hard ones, we instead adopt adversarial learning to enforce hard-to-easy feature alignment. To ease the training process and avoid noisy predictions, we introduce the bootstrapping mechanism to the original self-training loss. We show the proposed TPLD can be integrated into any type of existing self-training based approaches and improve the performance significantly. Combined with the recently proposed CRST self-training framework, we achieve new state-of-the-art results on two standard UDA benchmarks."



Paperid:583
Authors:Tianlang Chen, Jiajun Deng, Jiebo Luo
Title: Adaptive Offline Quintuplet Loss for Image-Text Matching
Abstract:
Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the model's capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different ``hardness'' degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides ``harder'' offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the model's generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as the intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, proving the effectiveness and generality of our approach. Code is available at https://github.com/sunnychencool/AOQ."



Paperid:584
Authors:Lingzhi Zhang, Tarmily Wen, Jie Min, Jiancong Wang, David Han, Jianbo Shi
Title: Learning Object Placement by Inpainting for Compositional Data Augmentation
Abstract:
We study the problem of common sense placement of visual objects in an image.  This involves multiple aspects of visual recognition: the instance segmentation of the scene, 3D layout, and common knowledge of how objects are placed and where objects are moving in the 3D scene. This seemingly simple task is difficult for current learning-based approaches because of the lack of labeled training of foreground objects paired with cleaned background scenes. We propose a self-learning framework that automatically generates the necessary training data without any manual labeling by detecting, cutting, and inpainting objects from an image.  We propose a PlaceNet that predicts a diverse distribution of common sense locations when given a foreground object and a background scene. We show one practical use of our object placement network for augmenting training datasets by recomposition of object-scene with a key property of contextual relationship preservation. We demonstrate improvement of object detection and instance segmentation performance on both Cityscape and datasets.  We also show that the learned representation of our PlaceNet displays strong discriminative power in image retrieval and classification."



Paperid:585
Authors:Vage Egiazarian, Oleg Voynov, Alexey Artemov, Denis Volkhonskiy, Aleksandr Safin, Maria Taktasheva, Denis Zorin, Evgeny Burnaev
Title: Deep Vectorization of Technical Drawings
Abstract:
We present a new method for vectorization of technical line drawings, such as floor plans, architectural drawings, and 2D CAD images. Our method includes (1) a deep learning-based cleaning stage to eliminate the background and imperfections in the image and fill in missing parts, (2) a transformer-based network to estimate vector primitives, and (3) optimization procedure to obtain the final primitive configurations. We train the networks on synthetic data, renderings of vector line drawings, and manually vectorized scans of line drawings. Our method quantitatively and qualitatively outperforms a number of existing techniques on a collection of representative technical drawings."



Paperid:586
Authors:Vladislav Ishimtsev, Alexey Bokhovkin, Alexey Artemov, Savva Ignatyev , Matthias Niessner, Denis Zorin, Evgeny Burnaev
Title: CAD-Deform: Deformable Fitting of CAD Models to 3D Scans
Abstract:
Shape retrieval and alignment are a promising avenue towards turning 3D scans into lightweight CAD representations that can be used for content creation such as mobile or AR/VR gaming scenarios. Unfortunately, CAD models retrieval is limited by the availability of models in the common shape corpuses (e.g., ShapeNet). In this work, we address this shortcoming by introducing CAD-Deform, a method which obtains more accurate CAD-to-scan fits by non-rigidly deforming retrieved CAD models. Our key contribution is a new non-rigid deformation model incorporating smooth transformations and preservation of sharp features, that simultaneously achieves very tight fits from CADs to the 3D scan and in addition maintains the clean, high-quality surface properties of hand-modeled CAD objects. A series of thorough experiments demonstrates that our method achieves significantly tighter scan-to-CAD fits, allowing a more accurate digital replica of the scanned real-world environment, while preserving important geometric features present in synthetic CAD environments."



Paperid:587
Authors:Xiaolong Ma, Wei Niu, Tianyun Zhang, Sijia Liu, Sheng Lin, Hongjia Li, Wujie Wen, Xiang Chen, Jian Tang, Kaisheng Ma, Bin Ren, Yanzhi Wang
Title: An Image Enhancing Pattern-based Sparsity for Real-time Inference on Mobile Devices
Abstract:
Weight pruning has been widely acknowledged as a straightforward and effective method to eliminate redundancy in Deep Neural Networks (DNN), thereby achieving acceleration on various platforms. However, most of the pruning techniques are essentially trade-offs between model accuracy and regularity which lead to impaired inference accuracy and limited on-device acceleration performance. To solve the problem, we introduce a new sparsity dimension, namely pattern-based sparsity that comprises pattern and connectivity sparsity, and becoming both highly accurate and hardware friendly. With carefully designed patterns, the proposed pruning unprecedentedly and consistently achieves accuracy enhancement and better feature extraction ability on different DNN structures and datasets, and our pattern-aware pruning framework also achieves pattern library extraction, pattern selection, pattern and connectivity pruning and weight training simultaneously. Our approach on the new pattern-based sparsity naturally fits into compiler optimization for highly efficient DNN execution on mobile platforms. To the best of our knowledge, it is the first time that mobile devices achieve real-time inference for the large-scale DNN models thanks to the unique spatial property of pattern-based sparsity and the help of the code generation capability of compilers."



Paperid:588
Authors:Yuexin Ma, Xinge Zhu, Xinjing Cheng, Ruigang Yang, Jiming Liu, Dinesh Manocha
Title: AutoTrajectory: Label-free Trajectory Extraction and Prediction from Videos using Dynamic Points
Abstract:
Current methods for trajectory prediction operate in supervised manners, and therefore require vast quantities of corresponding ground truth data for training. In this paper, we present a novel, label-free algorithm, AutoTrajectory, for trajectory extraction and prediction to use raw videos directly. To better capture the moving objects in videos, we introduce dynamic points. We use them to model dynamic motions by using a forward-backward extractor to keep temporal consistency and using image reconstruction to keep spatial consistency in an unsupervised manner. Then we aggregate dynamic points to instance points, which stand for moving objects such as pedestrians in videos. Finally, we extract trajectories by matching instance points for prediction training. To the best of our knowledge, our method is the first to achieve unsupervised learning of trajectory extraction and prediction. We evaluate the performance on well-known trajectory datasets and show that our method is effective for real-world videos and can use raw videos to further improve the performance of existing models."



Paperid:589
Authors:Sinan Tan, Weilai Xiang, Huaping Liu, Di Guo, Fuchun Sun
Title: Multi-Agent Embodied Question Answering in Interactive Environments
Abstract:
We investigate a new AI task --- Multi-Agent Interactive Question Answering --- where several agents explore the scene jointly in interactive environments to answer a question. To cooperate efficiently and answer accurately, agents must be well-organized to have balanced work division and share knowledge about the objects involved. We address this new problem in two stages: Multi-Agent 3D Reconstruction in Interactive Environments and Question Answering. Our proposed framework features multi-layer structural and semantic memories shared by all agents, as well as a question answering model built upon a 3D-CNN network to encode the scene memories. During the reconstruction, agents simultaneously explore and scan the scene with a clear division of work, organized by next viewpoints planning. We evaluate our framework on the IQuADv1 dataset and outperform the IQA baseline in a single-agent scenario. In multi-agent scenarios, our framework shows favorable speedups while remaining high accuracy."



Paperid:590
Authors:Jingwen He, Yihao Liu, Yu Qiao, Chao Dong
Title: Conditional Sequential Modulation for Efficient Global Image Retouching
Abstract:
Photo retouching aims at enhancing the aesthetic visual quality of images that suffer from photographic defects such as over/under exposure, poor contrast, inharmonious saturation. Practically, photo retouching can be accomplished by a series of image processing operations. In this paper, we investigate some commonly-used retouching operations and mathematically find that these pixel-independent operations can be approximated or formulated by multi-layer perceptrons (MLPs). Based on this analysis, we propose an extremely light-weight framework - Conditional Sequential Retouching Network (CSRNet) - for efficient global image retouching. CSRNet consists of a base network and a condition network. The base network acts like an MLP that processes each pixel independently and the condition network extracts the global features of the input image to generate a condition vector. To realize retouching operations, we modulate the intermediate features using Global Feature Modulation (GFM), of which the parameters are transformed by condition vector. Benefiting from the utilization of $1 imes1$ convolution, CSRNet only contains less than 37k trainable parameters, which is orders of magnitude smaller than existing learning-based methods. Extensive experiments show that our method achieves state-of-the-art performance on the benchmark MIT-Adobe FiveK dataset quantitively and qualitatively. Code is available at \url{https://github.com/hejingwenhejingwen/CSRNet}."



Paperid:591
Authors:Enze Xie, Wenjia Wang, Wenhai Wang, Mingyu Ding, Chunhua Shen, Ping Luo
Title: Segmenting Transparent Objects in the Wild
Abstract:
Transparent objects such as windows and bottles made by glass widely exist in the real world. Segmenting transparent objects is challenging because these objects have diverse appearances inherited from the image background, making them had similar appearance with their surroundings. Besides the technical diculty of this task, only a few previous datasets were specially designed and collected to explore this task and most of the existing datasets have major drawbacks. They either possess limited sample size such as merely a thousand of images without manual annotations, or they generate all images by using computer graphics method (i.e. not real image). To address this important problem, this work proposes a large-scale dataset for transparent object segmentation, named Trans10K, consisting of 10,428 images of real scenarios with carefully manual annotations, which are 10 times larger than the existing datasets. The transparent objects in Trans10K are extremely challenging due to high diversity in scale, viewpoint and occlusion as shown in Fig. 1. To evaluate the effectiveness of Trans10K, we propose a novel boundary-aware segmentation method, termed TransLab, which exploits boundary as the clue to improve segmentation of transparent objects. Extensive experiments and ablation studies demonstrate the effectiveness of Trans10K and validate the practicality of learning object boundary in TransLab. For example, TransLab significantly outperforms 20 recent object segmentation methods based on deep learning, showing that this task is largely unsolved. We believe that both Trans10K and TransLab have important contributions to both the academia and industry, facilitating future researches and applications."



Paperid:592
Authors:Chaorui Deng, Ning Ding, Mingkui Tan, Qi Wu
Title: Length-Controllable Image Captioning
Abstract:
The last decade has witnessed remarkable progress in the image captioning task; however, most existing methods cannot control their captions, mph{e.g.}, choosing to describe the image either roughly or in detail. In this paper, we propose to use a simple length level embedding to endow them with this ability. Moreover, due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. Thus, we further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity. We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability. In the experiments, our length-controllable image captioning models not only achieve SOTA performance on the challenging MS COCO dataset but also generate length-controllable and diverse image captions. Specifically, our non-autoregressive model outperforms the autoregressive baselines in terms of controllability and diversity, and also significantly improves the decoding efficiency for long captions. Our code and models are released at extcolor{magenta}{ exttt{https://github.com/bearcatt/LaBERT}}."



Paperid:593
Authors:Haochen Wang, Xudong Zhang, Yutao Hu, Yandan Yang, Xianbin Cao, Xiantong Zhen
Title: Few-Shot Semantic Segmentation with Democratic Attention Networks
Abstract:
Few-shot segmentation has recently generated great popularity, addressing a challenging yet important problem of segmenting objects from unseen categories with scarce annotated support images. The crux of few-shot segmentation is to extract object information from the support image and then propagate it to guide the segmentation of query images. In this paper, we propose the Democratic Attention Network (DAN) for few-shot semantic segmentation. We introduce the democratized graph attention mechanism, which can activate more pixels on the object to establish a robust correspondence between support and query images. Thus, the network is able to propagate more guiding information of foreground objects from support to query images, enhancing its robustness and generalizability to new objects. Furthermore, we propose multi-scale guidance by designing a refinement fusion unit to fuse features from intermediate layers for the segmentation of the query image. This offers an efficient way of leveraging multi-level semantic information to achieve more accurate segmentation. Extensive experiments on three benchmarks demonstrate that the proposed DAN achieves the new state-of-the-art performance, surpassing the previous methods by large margins. The thorough ablation studies further reveal its great effectiveness for few-shot semantic segmentation."



Paperid:594
Authors:Xiaodong Cun, Chi-Man Pun
Title: Defocus Blur Detection via Depth Distillation
Abstract:
Defocus Blur Detection (DBD) aims to separate in-focus and out-of-focus regions from a single image pixel-wisely. This task has been paid much attention since bokeh effects are widely used in digital cameras and smartphone photography. However, identifying obscure homogeneous regions and borderline transitions in partially defocus images are still challenging. To solve these problems, we introduce depth information into DBD for the first time. When the camera parameters are fixed, we argue that the accuracy of DBD is highly related to scene depth. Hence, we consider the depth information as the approximate soft label of DBD and propose a joint learning framework inspired by knowledge distillation. In detail, we learn the defocus blur from ground truth and the depth distilled from a well-trained depth estimation network at the same time. Thus, the sharp region will provide a strong prior for depth estimation while the blur detection also gains benefits from the distilled depth. Besides, we propose a novel decoder in the fully convolutional network (FCN) as our network structure. In each level of decoder, we design the Selective Reception Field Block (SRFB) for merging multi-scale features efficiently and reuse the side outputs as Supervision-guided Attention Block(SAB). Unlike previous methods, the proposed decoder builds reception field pyramids and emphasizes salient regions simply and efficiently. Experiments show that our approach outperforms 11 other state-of-the-art methods on two popular datasets. Our method also runs at over 30 fps on a single GPU, which is 2x faster than previous works."



Paperid:595
Authors:Jingbo Wang, Sijie Yan, Yuanjun Xiong, Dahua Lin
Title: Motion Guided 3D Pose Estimation from Videos
Abstract:
We propose a new loss function, called motion loss, for the problem of monocular 3D Human pose estimation from 2D pose. It introduces the task of reconstructing keypoint motion into supervision. In computing motion loss, a simple yet effective representation for keypoint motion, called multiscale motion encoding, is introduced. We design a new graph convolutional network architecture, U-shaped GCN (UGCN). It captures both short-term and long-term motion information to fully leverage the additional supervision from the motion loss. We experiment training UGCN with the motion loss on two large scale benchmarks: Human3.6M and MPI-INF-3DHP. Our model surpasses other state-of-the-art models by a large margin. It also demonstrates strong capacity in producing smooth 3D sequences and recovering keypoint motion."



Paperid:596
Authors:Rui Li, Simeng Qiu, Guangming Zang, Wolfgang Heidrich
Title: Reflection Separation via Multi-bounce Polarization State Tracing
Abstract:
Reflection removal from photographs is an important task in computational photography, but also for computer vision tasks that involve imaging through windows and similar settings. Traditionally, the problem is approached as a single reflection removal problem under very controlled scenarios. In this paper we aim to generalize the reflection removal to real-world scenarios with more complicated light interactions. To this end, we propose a simple yet efficient learning framework for supervised image reflection separation with a polarization simulation engine for synthetic polarized data generation and loss function design. Instead of a conventional image sensor, we use a polarization sensor that instantaneously captures four linearly polarized photos of the scene in the same image. Through a combination of a new polarization-guided image formation model and a novel supervised learning framework for the interpretation of a ray-tracing polarized image formation model, a general method is obtained to tackle general image reflection removal problems. We demonstrate our method with extensive experiments on both real and synthetic data and demonstrate the unprecedented quality of image reconstructions. "



Paperid:597
Authors:Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao
Title: SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation
Abstract:
Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO $ exttt{test-dev}$, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at https://github.com/JialeCao001/SipMask. "



Paperid:598
Authors:Haonan Qiu, Chaowei Xiao, Lei Yang, Xinchen Yan, Honglak Lee, Bo Li
Title: SemanticAdv: Generating Adversarial Examples via Attribute-conditioned Image Editing
Abstract:
Deep neural networks (DNNs) have achieved great successes in various vision applications due to their strong expressive power. However, recent studies have shown that DNNs are vulnerable to adversarial examples which are manipulated instances targeting to mislead DNNs to make incorrect predictions. Currently, most such adversarial examples try to guarantee ``subtle perturbation"" by limiting the $L_p$ norm of the perturbation. In this paper, we propose SemanticAdv to generate a new type of semantically realistic adversarial examples via attribute-conditioned image editing. Compared to existing methods, our SemanticAdv enables fine-grained analysis and evaluation of DNNs with input variations in the attribute space. We conduct comprehensive experiments to show that our adversarial examples not only exhibit semantically meaningful appearances but also achieve high targeted attack success rates under both whitebox and blackbox settings. Moreover, we show that the existing pixel-based and attribute-based defense methods fail to defend against our attribute-conditioned adversarial examples. We demonstrate the applicability of SemanticAdv on both face recognition and general street-view images to show its generalization. Such non-$L_p$ bounded adversarial examples with controlled attribute manipulation can shed light on further understanding about vulnerabilities of DNNs as well as novel defense approaches. "



Paperid:599
Authors:Longrong Yang, Fanman Meng, Hongliang Li, Qingbo Wu, Qishang Cheng
Title: Learning with Noisy Class Labels for Instance Segmentation
Abstract:
Instance segmentation has achieved siginificant progress in the presence of correctly annotated datasets. Yet, object classes in large-scale datasets are sometimes ambiguous, which easily causes confusion. In addition, limited experience and knowledge of annotators can also lead to mislabeled object classes. To solve this issue, a novel method is proposed in this paper, which uses different losses describing different roles of noisy class labels to enhance the learning. Specifically, in instance segmentation, noisy class labels play different roles in the foreground-background sub-task and the foreground-instance sub-task. Hence, on the one hand, the noise-robust loss (e.g., symmetric loss) is used to prevent incorrect gradient guidance for the foreground-instance sub-task. On the other hand, standard cross entropy loss is used to fully exploit correct gradient guidance for the foreground-background sub-task. Extensive experiments conducted with three popular datasets (i.e., Pascal VOC, Cityscapes and COCO) have demonstrated the effectiveness of our method in a wide range of noisy class labels scenarios."



Paperid:600
Authors:Junjie Zhao, Donghuan Lu, Kai Ma, Yu Zhang, Yefeng Zheng
Title: Deep Image Clustering with Category-Style Representation
Abstract:
Deep clustering which adopts deep neural networks to obtain optimal representations for clustering has been widely studied recently. In this paper, we propose a novel deep image clustering framework to learn a category-style latent representation in which the category information is disentangled from image style and can be directly used as the cluster assignment. To achieve this goal, mutual information maximization is applied to embed relevant information in the latent representation. Moreover, augmentation-invariant loss is employed to disentangle the representation into category part and style part. Last but not least, a prior distribution is imposed on the latent representation to ensure the elements of the category vector can be used as the probabilities over clusters. Comprehensive experiments demonstrate that the proposed approach outperforms state-of-the-art methods significantly on five public datasets."



Paperid:601
Authors:Yuan Tian, Zhaohui Che, Wenbo Bao, Guangtao Zhai, Zhiyong Gao
Title: Self-supervised Motion Representation via Scattering Local Motion Cues
Abstract:
Motion representation is key to many computer vision problems but has never been well studied in the literature. Existing works usually rely on the optical flow estimation to assist other tasks such as action recognition, frame prediction, video segmentation, etc. In this paper, we leverage the massive unlabeled video data to learn an accurate explicit motion representation that aligns well with the semantic distribution of the moving objects. Our method subsumes a coarse-to- fine paradigm, which fi rst decodes the low-resolution motion maps from the rich spatial-temporal features of the video, then adaptively upsamples the low-resolution maps to the full-resolution by considering the semantic cues. To achieve this, we propose a novel context guided motion upsampling layer that leverages the spatial context of video objects to learn the upsampling parameters in an efficient way. We prove the effectiveness of our proposed motion representation method on downstream video understanding tasks, e.g., action recognition task. Experimental results show that our method performs favorably against state-of-the-art methods."



Paperid:602
Authors:Tian Chen, Shijie An, Yuan Zhang, Chongyang Ma , Huayan Wang, Xiaoyan Guo, Wen Zheng
Title: Improving Monocular Depth Estimation by Leveraging Structural Awareness and Complementary Datasets
Abstract:
Monocular depth estimation plays a crucial role in 3D recognition and understanding. One key limitation of existing approaches lies in their lack of structural information exploitation, which leads to inaccurate spatial layout, discontinuous surface, and ambiguous boundaries. In this paper, we tackle this problem in three aspects. First, to exploit the spatial relationship of visual features, we propose a structure-aware neural network with spatial attention blocks. These blocks guide the network attention to global structures or local details across different feature layers. Second, we introduce a global focal relative loss for uniform point pairs to enhance spatial constraint in the prediction, and explicitly increase the penalty on errors in depth-wise discontinuous regions, which helps preserve the sharpness of estimation results. Finally, based on analysis of failure cases for prior methods, we collect a new Hard Case (HC) Depth dataset of challenging scenes, such as special lighting conditions, dynamic objects, and tilted camera angles. The new dataset is leveraged by an informed learning curriculum that mixes training examples incrementally to handle diverse data distributions. Experimental results show that our method outperforms state-of-the-art approaches by a large margin in terms of both prediction accuracy on NYUDv2 dataset and generalization performance on unseen datasets."



Paperid:603
Authors:Junheum Park, Keunsoo Ko, Chul Lee, Chang-Su Kim
Title: BMBC: Bilateral Motion Estimation with Bilateral Cost Volume for Video Interpolation
Abstract:
Video interpolation increases the temporal resolution of a video sequence by synthesizing intermediate frames between two consecutive frames. We propose a novel deep-learning-based video interpolation algorithm based on bilateral motion estimation. First, we develop the bilateral motion network with the bilateral cost volume to estimate bilateral motions accurately. Then, we approximate bi-directional motions to predict a different kind of bilateral motions. We then warp the two input frames using the estimated bilateral motions. Next, we develop the dynamic filter generation network to yield dynamic blending filters. Finally, we combine the warped frames using the dynamic blending filters to generate intermediate frames. Experimental results show that the proposed algorithm outperforms the state-of-the-art video interpolation algorithms on several benchmark datasets."



Paperid:604
Authors:Hong Xuan, Abby Stylianou, Xiaotong Liu, Robert Pless
Title: Hard negative examples are hard, but useful
Abstract:
but useful","Triplet loss is an extremely common approach to distance metric learning. Representations of images from the same class are optimized to be mapped closer together in an embedding space than representations of images from different classes. Much work on triplet losses focuses on selecting the most useful triplets of images to consider, with strategies that select dissimilar examples from the same class or similar examples from different classes. The consensus of previous research is that optimizing with the extit{hardest} negative examples leads to bad training behavior. That's a problem -- these hardest negatives are literally the cases where the distance metric fails to capture semantic similarity. In this paper, we characterize the space of triplets and derive why hard negatives make triplet loss training fail. We offer a simple fix to the loss function and show that, with this fix, optimizing with hard negative examples becomes feasible. This leads to more generalizable features, and image retrieval results that outperform state of the art for datasets with high intra-class variance."



Paperid:605
Authors:Zechun Liu, Zhiqiang Shen, Marios Savvides, Kwang-Ting Cheng
Title: ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions
Abstract:
In this paper, we propose several ideas for enhancing a bi- nary network to close its accuracy gap from real-valued networks without incurring any additional computational cost. We first construct a base- line network by modifying and binarizing a compact real-valued network with parameter-free shortcuts, bypassing all the intermediate convolu- tional layers including the downsampling layers. This baseline network strikes a good trade-o↵ between accuracy and e ciency, achieving su- perior performance than most of existing binary networks at approxi- mately half of the computational cost. Through extensive experiments and analysis, we observed that the performance of binary networks is sensitive to activation distribution variations. Based on this important observation, we propose to generalize the traditional Sign and PReLU functions, denoted as RSign and RPReLU for the respective general- ized functions, to enable explicit learning of the distribution reshape and shift at near-zero extra cost. Lastly, we adopt a distributional loss to further enforce the binary network to learn similar output distribu- tions as those of a real-valued network. We show that after incorporating all these ideas, the proposed ReActNet outperforms all the state-of-the- arts by a large margin. Specifically, it outperforms Real-to-Binary Net and MeliusNet29 by 4.0% and 3.6% respectively for the top-1 accuracy and also reduces the gap to its real-valued counterpart to within 3.0% top-1 accuracy on ImageNet dataset. Code and models are available at: https://github.com/liuzechun/ReActNet."



Paperid:606
Authors:Chun-Han Yao, Chen Fang, Xiaohui Shen, Yangyue Wan, Ming-Hsuan Yang
Title: Video Object Detection via Object-level Temporal Aggregation
Abstract:
While single-image object detectors can be naively applied to videos in a frame-by-frame fashion, the prediction is often temporally inconsistent. Moreover, the computation can be redundant since neighboring frames are inherently similar to each other. In this work we propose to improve video object detection via temporal aggregation. Specifically, a detection model is applied on sparse keyframes to handle new objects, occlusions, and rapid motions. We then use real-time trackers to exploit temporal cues and track the detected objects in the remaining frames, which enhances efficiency and temporal coherence. Object status at the bounding-box level is propagated across frames and updated by our aggregation modules. For keyframe scheduling, we propose adaptive policies using reinforcement learning and simple heuristics. The proposed framework achieves the state-of-the-art performance on the Imagenet VID 2015 dataset while running real-time on CPU. Extensive experiments are done to show the effectiveness of our training strategies and justify the model designs."



Paperid:607
Authors:Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, Ying Wu
Title: Object Detection with a Unified Label Space from Multiple Datasets
Abstract:
Given multiple datasets with different label spaces, the goal of this work is to train a single object detector predicting over the union of all the label spaces. The practical benefits of such an object detector are obvious and significant---application-relevant categories can be picked and merged form arbitrary existing datasets. However, naive merging of datasets is not possible in this case, due to inconsistent object annotations. Consider an object category like faces that is annotated in one dataset, but is not annotated in another dataset, although the object itself appears in the later's images. Some categories, like face here, would thus be considered foreground in one dataset, but background in another. To address this challenge, we design a framework which works with such partial annotations, and we exploit a pseudo labeling approach that we adapt for our specific case. We propose loss functions that carefully integrate partial but correct annotations with complementary but noisy pseudo labels. Evaluation in the proposed novel setting requires full annotation on the test set. We collect the required annotations and define a new challenging experimental setup for this task based on existing public datasets. We show improved performances compared to competitive baselines and appropriate adaptations of existing work."



Paperid:608
Authors:Jonah Philion, Sanja Fidler
Title: Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D
Abstract:
Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D","The goal of perception for autonomous vehicles is to extract semantic representations from multiple sensors and fuse these representations into a single ``bird's-eye-view'' coordinate frame for consumption by motion planning. We propose a new end-to-end architecture that directly extracts a bird's-eye-view representation of a scene given image data from an arbitrary number of cameras. The core idea behind our approach is to ``lift'' each image individually into a frustum of features for each camera, then ``splat'' all frustums into a rasterized bird's-eye-view grid. By training on the entire camera rig, we provide evidence that our model is able to learn not only how to represent images but how to fuse predictions from all cameras into a single cohesive representation of the scene while being robust to calibration error. On standard bird's-eye-view tasks such as object segmentation and map segmentation, our model outperforms all baselines and prior work. In pursuit of the goal of learning dense representations for motion planning, we show that the representations inferred by our model enable interpretable end-to-end motion planning by ``shooting'' template trajectories into a bird's-eye-view cost map output by our network. We benchmark our approach against models that use oracle depth from lidar. Project page with code: https://nv-tlabs.github.io/lift-splat-shoot."



Paperid:609
Authors:Yiwu Zhong, Liwei Wang, Jianshu Chen, Dong Yu, Yin Li
Title: Comprehensive Image Captioning via Scene Graph Decomposition
Abstract:
We address the challenging problem of image captioning by revisiting the representation of image scene graph. At the core of our method lies the decomposition of a scene graph into a set of sub-graphs, with each sub-graph capturing a semantic component of the input image. We design a deep model to select important sub-graphs, and to decode each selected sub-graph into a single target sentence. By using sub-graphs, our model is able to attend to different components of the image. Our method thus accounts for accurate, diverse, grounded and controllable captioning at the same time. We present extensive experiments to demonstrate the benefits of our comprehensive captioning model. Our method establishes new state-of-the-art results in caption diversity, grounding, and controllability, and compares favourably to latest methods in caption quality. Our project website can be found at pages.cs.wisc.edu/~yiwuzhong/Sub-GC.html."



Paperid:610
Authors:Yu-Tong Cao, Jingya Wang, Dacheng Tao
Title: Symbiotic Adversarial Learning for Attribute-based Person Search
Abstract:
Attribute-based person search is in significant demand for applications where no detected query images are available, such as identifying a criminal from witness. However, the task itself is quite challenging because there is a huge modality gap between images and physical descriptions of attributes. Often, there may also be a large number of unseen categories (attribute combinations). The current state-of-the-art methods either focus on learning better cross-modal embeddings by mining only seen data, or they explicitly use generative adversarial networks (GANs) to synthesize unseen features. The former tends to produce poor embeddings due to insufficient data, while the latter does not preserve intra-class compactness during generation. In this paper, we present a symbiotic adversarial learning framework, called SAL. Two GANs sit at the base of the framework in a symbiotic learning scheme: one synthesizes features of unseen classes/categories, while the other optimizes the embedding and performs the cross-modal alignment on the common embedding space. Specifically, two different types of generative adversarial networks learn collaboratively throughout the training process and the interactions between the two mutually benefit each other. Extensive evaluations show SAL’s superiority over nine state-of-the-art methods with two challenging pedestrian benchmarks, PETA and Market-1501. The code is publicly available at: https://github:com/ycao5602/SAL."



Paperid:611
Authors:Yang Liu, Qingchao Chen, Andrew Zisserman
Title: Amplifying Key Cues for Human-Object-Interaction Detection
Abstract:
Human-object interaction (HOI) detection aims to detect and recognise how people interact with the objects that surround them. This is challenging as different interaction categories are often distinguished only by very subtle visual differences in the scene. In this paper we introduce two methods to amplify key cues in the image, and also a method to combine these and other cues when considering the interaction between a human and an object. First, we introduce an encoding mechanism for representing the fine-grained spatial layout of the human and object (a subtle cue) and also semantic context (a cue, represented by text embeddings of surrounding objects). Second, we use plausible future movements of humans and objects as a cue to constrain the space of possible interactions. Third, we use a gate and memory architecture as a fusion module to combine the cues. We demonstrate that these three improvements lead to a performance which exceeds prior HOI methods across standard benchmarks by a considerable margin."



Paperid:612
Authors:Yonglong Tian, Yue Wang, Dilip Krishnan, Joshua B. Tenenbaum, Phillip Isola
Title: Rethinking Few-shot Image Classification: A Good Embedding is All You Need?
Abstract:
The focus of recent meta-learning research has been on the development of learning algorithms that can quickly adapt to test time tasks with limited data and low computational cost. Few-shot learning is widely used as one of the standard benchmarks in meta-learning. In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. An additional boost can be achieved through the use of self-distillation. This demonstrates that using a good learned embedding model can be more effective than sophisticated meta-learning algorithms. We believe that our findings motivate a rethinking of few-shot image classification benchmarks and the associated role of meta-learning algorithms. "



Paperid:613
Authors:Kyle Min, Jason J. Corso
Title: Adversarial Background-Aware Loss for Weakly-supervised Temporal Activity Localization
Abstract:
Temporally localizing activities within untrimmed videos has been extensively studied in recent years. Despite recent advances, existing methods for weakly-supervised temporal activity localization struggle to recognize when an activity is not occurring. To address this issue, we propose a novel method named A2CL-PT. Two triplets of the feature space are considered in our approach: one triplet is used to learn discriminative features for each activity class, and the other one is used to distinguish the features where no activity occurs (i.e. background features) from activity-related features for each video. To further improve the performance, we build our network using two parallel branches which operate in an adversarial way: the first branch localizes the most salient activities of a video and the second one finds other supplementary activities from non-localized parts of the video. Extensive experiments performed on THUMOS14 and ActivityNet datasets demonstrate that our proposed method is effective. Specifically, the average mAP of IoU thresholds from 0.1 to 0.9 on the THUMOS14 dataset is significantly improved from 27.9% to 30.0%."



Paperid:614
Authors:Sathyanarayanan Aakur, Sudeep Sarkar
Title: Action Localization through Continual Predictive Learning
Abstract:
The problem of action localization involves locating the action in the video, both over time and spatially in the image. The current dominant approaches use supervised learning to solve this problem. They require large amounts of annotated training data, in the form of frame-level bounding box annotations around the region of interest. In this paper, we present a new approach based on continual learning that uses feature-level predictions for self-supervision. It does not require any training annotations in terms of frame-level bounding boxes. The approach is inspired by cognitive models of visual event perception that propose a prediction-based approach to event understanding. We use a stack of LSTMs coupled with a CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames. The prediction errors are used to learn the parameters of the models continuously. This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization. It should be noted that the approach outputs in a streaming fashion, requiring only a single pass through the video, making it amenable for real-time processing. We demonstrate this on three datasets - UCF Sports, JHMDB, and THUMOS'13 and show that the proposed approach outperforms weakly-supervised and unsupervised baselines and obtains competitive performance compared to fully supervised baselines. Finally, we show that the proposed framework can generalize to egocentric videos and achieve state-of-the-art results on the unsupervised gaze prediction task. Code is available on the project page."



Paperid:615
Authors:Yunyu Liu, Lichen Wang, Yue Bai, Can Qin, Zhengming Ding, Yun Fu
Title: Generative View-Correlation Adaptation for Semi-Supervised Multi-View Learning
Abstract:
Multi-view learning (MVL) explores the data extracted from multiple resources. It assumes that the complementary information between different views could be revealed to further improve the learning performance. There are two challenges. First, it is difficult to effectively combine the different view data together while still fully preserve the view-specific information. Second, multi-view datasets are usually small. This situation is easily cause overfitting for general model. To address the challenges, we propose a novel View-Correlation Adaptation ( extit{VCA}) framework in semi-supervised fashion. A semi-supervised data augmentation method is designed to generate extra features and labels based on labeled and even unlabeled samples. In addition, a cross-view adversarial training strategy is proposed to explore the structural information from one view and help the representation learning of the other view. Moreover, a simple yet effective fusion network is proposed for the late fusion stage. In our model, all networks are jointly trained in an end-to-end fashion. Extensive experiments demonstrate that our approach is effective and stable compared with other state-of-the-art methods."



Paperid:616
Authors:Minho Shim, Hsuan-I Ho, Jinhyung Kim, Dongyoon Wee
Title: READ: Reciprocal Attention Discriminator for Image-to-Video Re-Identification
Abstract:
Person re-identification (re-ID) is the problem of visually identifying a person given a database of identities. In this work, we focus on image-to-video re-ID which compares a single query image to videos in the gallery. The main challenge is the asymmetry association of an image and a video, and overcoming the difference caused by the additional temporal dimension. To this end, we propose an attention-aware discriminator architecture. The attention occurs across different modalities, and even different identities to aggregate useful spatio-temporal information for comparison. The information is effectively fused into a united feature, followed by the final prediction of a similarity score. The performance of the method is shown with image-to-video person re-identification benchmarks (DukeMTMC-VideoReID, and MARS)."



Paperid:617
Authors:Shihao Zou, Xinxin Zuo, Yiming Qian, Sen Wang, Chi Xu, Minglun Gong , Li Cheng
Title: 3D Human Shape Reconstruction from a Polarization Image
Abstract:
This paper tackles the problem of estimating 3D body shape of clothed humans from single polarized 2D images, i.e. polarization images. Polarization images are known to be able to capture polarized reflected lights that preserve rich geometric cues of an object, which has motivated its recent applications in reconstructing surface normal of the objects of interest. Inspired by the recent advances in human shape estimation from single color images, in this paper, we attempt at estimating human body shapes by leveraging the geometric cues from single polarization images. A dedicated two-stage deep learning approach, SfP, is proposed: given a polarization image, stage one aims at inferring the fined-detailed body surface normal; stage two gears to reconstruct the 3D body shape of clothing details. Empirical evaluations on a synthetic dataset (SURREAL) as well as a real-world dataset (PHSPD) demonstrate the qualitative and quantitative performance of our approach in estimating human poses and shapes. This indicates polarization camera is a promising alternative to the more conventional color or depth imaging for human shape estimation. Further, normal maps inferred from polarization imaging play a significant role in accurately recovering the body shapes of clothed people. "



Paperid:618
Authors:Pirazh Khorramshahi, Neehar Peri, Jun-cheng Chen, Rama Chellappa
Title: The Devil is in the Details: Self-Supervised Attention for Vehicle Re-Identification
Abstract:
In recent years, the research community has approached the problem of vehicle re-identification (re-id) with attention-based models, specifically focusing on regions of a vehicle containing discriminative information. These re-id methods rely on expensive key-point labels, part annotations, and additional attributes including vehicle make, model, and color. Given the large number of vehicle re-id datasets with various levels of annotations, strongly-supervised methods are unable to scale across different domains. In this paper, we present Self-supervised Attention for Vehicle Re-identification (SAVER), a novel approach to effectively learn vehicle-specific discriminative features. Through extensive experimentation, we show that SAVER improves upon the state-of-the-art on challenging VeRi, VehicleID, Vehicle-1M and VERI-Wild datasets. "



Paperid:619
Authors:Zhengyuan Yang, Tianlang Chen, Liwei Wang, Jiebo Luo
Title: Improving One-stage Visual Grounding by Recursive Sub-query Construction
Abstract:
We improve one-stage visual grounding by addressing current limitations on grounding long and complex queries. Existing one-stage methods encode the entire language query as a single sentence embedding vector, e.g., taking the embedding from BERT or the hidden state from LSTM. This single vector representation is prone to overlooking the detailed descriptions in the query. To address this query modeling deficiency, we propose a recursive sub-query construction framework, which reasons between image and query for multiple rounds and reduces the referring ambiguity step by step. We show our new one-stage method obtains 5.0%, 4.5%, 7.5%, 12.8% absolute improvements over the state-of-the-art one-stage approach on ReferItGame, RefCOCO, RefCOCO+, and RefCOCOg, respectively. In particular, superior performances on longer and more complex queries validates the effectiveness of our query modeling. Code is available at https://github.com/zyang-ur/ReSC ."



Paperid:620
Authors:Jianyi Wang, Xin Deng, Mai Xu, Congyong Chen, Yuhang Song
Title: Multi-level Wavelet-based Generative Adversarial Network for Perceptual Quality Enhancement of Compressed Video
Abstract:
The past few years have witnessed fast development in video quality enhancement via deep learning. Existing methods mainly focus on enhancing the objective quality of compressed videos while ignoring its perceptual quality. In this paper, we focus on enhancing the perceptual quality of compressed videos. Our main observation is that enhancing the perceptual quality mostly relies on recovering the high-frequency sub-bands in wavelet domain. Accordingly, we propose a novel generative adversarial network (GAN) based on multi-level wavelet packet transform (WPT) to enhance the perceptual quality of compressed videos, which is called multi-level wavelet-based GAN (MW-GAN). In the MW-GAN, we first apply motion compensation with a pyramid architecture to obtain temporal information. Then, we propose a wavelet reconstruction network with wavelet-dense residual blocks (WDRB) to recover the high-frequency details. In addition, the adversarial loss of MW-GAN is added via WPT to further encourage high-frequency details recovery for video frames. Experimental results demonstrate the superiority of our method over state-of-the-art methods in enhancing the perceptual quality of compressed videos."



Paperid:621
Authors:Haitian Zheng, Haofu Liao, Lele Chen, Wei Xiong, Tianlang Chen, Jiebo Luo
Title: Example-Guided Image Synthesis using Masked Spatial-Channel Attention and Self-Supervision
Abstract:
Example-guided image synthesis has recently been attempted to synthesize an image from a semantic label map and an exemplary image. In the task, the additional exemplar image provides the style guidance that controls the appearance of the synthesized output. Despite the controllability advantage, the existing models are designed on datasets with specific and roughly aligned objects. In this paper, we tackle a more challenging and general task, where the exemplar is a scene image that is semantically different from the given label map. To this end, we first propose a Masked Spatial-Channel Attention (MSCA) module which models the correspondence between two scenes via efficient decoupled attention. Next, we propose an end-to-end network for joint global and local feature alignment and synthesis. Finally, we propose a novel self-supervision task to enable training. Experiments on the large-scale and more diverse COCO-stuff dataset show significant improvements over the existing methods. Moreover, our approach provides interpretability and can be readily extended to other content manipulation tasks including style and spatial interpolation or extrapolation."



Paperid:622
Authors:Guangrui Li, Guoliang Kang, Wu Liu, Yunchao Wei, Yi Yang
Title: Content-Consistent Matching for Domain Adaptive Semantic Segmentation
Abstract:
This paper considers the adaptation of semantic segmentation from the synthetic source domain to the real target domain. Different from most previous explorations that often aim at developing adversarial-based domain alignment solutions, we tackle this challenging task from a new perspective, mph{i.e.}, content-consistent matching (CCM). The target of CCM is to acquire those synthetic images that share similar distribution with the real ones in the target domain, so that the domain gap can be naturally alleviated by employing the content-consistent synthetic images for training. To be specific, we facilitate the CCM from two aspects, mph{i.e.}, semantic layout matching and pixel-wise similarity matching. First, we use all the synthetic images from the source domain to train an initial segmentation model, which is then employed to produce coarse pixel-level labels for the unlabeled images in the target domain. With the coarse/accurate label maps for real/synthetic images, we construct their semantic layout matrixes from both horizontal and vertical directions and perform the matrixes matching to find out the synthetic images with similar semantic layout to real images. Second, we choose those predicted labels with high confidence to generate feature embeddings for all classes in the target domain, and further perform the pixel-wise matching on the mined layout-consistent synthetic images to harvest the appearance-consistent pixels. With the proposed CCM, only those content-consistent synthetic images are taken into account for learning the segmentation model, which can effectively alleviate the domain bias caused by those content-irrelevant synthetic images. Extensive experiments are conducted on two popular domain adaptation tasks, mph{i.e.}, GTA5$\xrightarrow{}$Cityscapes and SYNTHIA$\xrightarrow{}$Cityscapes. Our CCM yields consistent improvements over the baselines and performs favorably against previous state-of-the-arts."



Paperid:623
Authors:Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, ZhiBo Yang, Tong Lu, Chunhua Shen, Ping Luo
Title: AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting
Abstract:
Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the characters (e.g. ""BERLIN"" is incorrectly detected as ""BERL"" and ""IN"" in Fig. 1(c)). Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. The proposed AE TextSpotter has three important benefits. 1) A carefully designed language module is utilized to reduce the detection confidence of incorrect text lines, making them easily pruned in the detection stage. 2) The linguistic representation is learned together with the visual representation in a framework. To our knowledge, it is the first time to improve text detection by using a language model. 3) Extensive experiments show that AE TextSpotter outperforms other state-of-the-art methods by a large margin. For example, we carefully select a set of extremely ambiguous samples from the IC19-ReCTS dataset, where our approach surpasses other methods by more than 4%."



Paperid:624
Authors:Wei Mao, Miaomiao Liu, Mathieu Salzmann
Title: History Repeats Itself: Human Motion Prediction via Motion Attention
Abstract:
Human motion prediction aims to forecast future human poses given a past motion. Whether based on recurrent or feed-forward neural networks, existing methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention-based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW evidence the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at https://github.com/wei-mao-2019/HisRepItself."



Paperid:625
Authors:Lu Zhang, Jianming Zhang, Zhe Lin, Radomír Měch, Huchuan Lu, You He
Title: Unsupervised Video Object Segmentation with Joint Hotspot Tracking
Abstract:
Object tracking is a well-studied problem in computer vision while identifying salient spots of objects in a video is a less explored direction in the literature. Video eye gaze estimation methods aim to tackle a related task but salient spots in those methods are not bounded by objects and tend to produce very scattered, unstable predictions due to the noisy ground truth data. We reformulate the problem of detecting and tracking of salient object spots as a new task called object hotspot tracking. In this paper, we propose to tackle this task jointly with unsupervised video object segmentation, in real-time, with a unified framework to exploit the synergy between the two. Specifically, we propose a Weighted Correlation Siamese Network (WCS-Net) which employs a Weighted Correlation Block (WCB) for encoding the pixel-wise correspondence between a template frame and the search frame. In addition, WCB takes the initial mask / hotspot as guidance to enhance the influence of salient regions for robust tracking. Our system can operate online during inference and jointly produce the object mask and hotspot track-lets at 33 FPS. Experimental results validate the effectiveness of our network design, and show the benefits of jointly solving the hotspot tracking and object segmentation problems. In particular, our method performs favorably against state-of-the-art video eye gaze models in object hotspot tracking, and outperforms existing methods on three benchmark datasets for unsupervised video object segmentation."



Paperid:626
Authors:Ailing Zeng, Xiao Sun, Fuyang Huang, Minhao Liu, Qiang Xu, Stephen Lin
Title: SRNet: Improving Generalization in 3D Human Pose Estimation with a Split-and-Recombine Approach
Abstract:
Human poses that are rare or unseen in a training set are challenging for a network to predict. Similar to the long-tailed distribution problem in visual recognition, the small number of examples for such poses limits the ability of networks to model them. Interestingly, local pose distributions suffer less from the long-tail problem, i.e., local joint configurations within a rare pose may appear within other poses in the training set, making them less rare. We propose to take advantage of this fact for better generalization to rare and unseen poses. To be specific, our method splits the body into local regions and processes them in separate network branches, utilizing the property that a joint’s position depends mainly on the joints within its local body region. Global coherence is maintained by recombining the global context from the rest of the body into each branch as a low-dimensional vector. With the reduced dimensionality of less relevant body areas, the training set distribution within network branches more closely reflects the statistics of local poses instead of global body poses, without sacrificing information important for joint inference. The proposed split-and-recombine approach, called SRNet, can be easily adapted to both single-image and temporal models, and it leads to appreciable improvements in the prediction of rare and unseen poses."



Paperid:627
Authors:Jeong gi Kwak, David K. Han, Hanseok Ko
Title: CAFE-GAN: Arbitrary Face Attribute Editing with Complementary Attention Feature
Abstract:
The goal of face attribute editing is altering a facial image according to given target attributes such as hair color, mustache, gender, etc. It belongs to the image-to-image domain transfer problem with a set of attributes considered as a distinctive domain. There have been some works in multi-domain transfer problem focusing on facial attribute editing employing Generative Adversarial Network (GAN). These methods have reported some successes but they also result in unintended changes in facial regions - meaning the generator alters regions unrelated to the specified attributes. To address this unintended altering problem, we propose a novel GAN model which is designed to edit only the parts of a face pertinent to the target attributes by the concept of Complementary Attention Feature (CAFE). CAFE identifies the facial regions to be transformed by considering both target attributes as well as nquote{complementary attributes}, which we define as those attributes absent in the input facial image. In addition, we introduce a complementary feature matching to help in training the generator for utilizing the spatial information of attributes. Effectiveness of the proposed method is demonstrated by analysis and comparison study with state-of-the-art methods."



Paperid:628
Authors:Xin Lu, Quanquan Li, Buyu Li, Junjie Yan
Title: MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection
Abstract:
Modern object detection methods can be divided into one-stage approaches and two-stage ones. One-stage detectors are more efficient owing to straightforward architectures, but the two-stage detectors still take the lead in accuracy. Although recent work try to improve the one-stage detectors by imitating the structural design of the two-stage ones, the accuracy gap is still significant. In this paper, we propose MimicDet, a novel and efficient framework to train a one-stage detector by directly mimic the two-stage features, aiming to bridge the accuracy gap between one-stage and two-stage detectors. Unlike conventional mimic methods, MimicDet has a shared backbone for one-stage and two-stage detectors, then it branches into two heads which are well designed to have compatible features for mimicking. Thus MimicDet can be end-to-end trained without the pre-train of the teacher network. And the cost of memory does not increase much, which makes it practical to adopt large networks as backbones. We also make several specialized designs such as dual-path mimicking and staggered feature pyramid to facilitate the mimicking process. Experiments on the challenging COCO detection benchmark demonstrate the effectiveness of MimicDet. It achieves 46.1 mAP with ResNeXt-101 backbone on the COCO test-dev set, which significantly surpasses current state-of-the-art methods."



Paperid:629
Authors:Jianghong Ma, Yang Liu
Title: Latent Topic-aware Multi-Label Classification
Abstract:
In real-world applications, data are often associated with different labels. Although most extant multi-label learning algorithms consider the label correlations, they rarely consider the topic information hidden in the labels, where each topic is a group of related labels and different topics have different groups of labels. In our study, we assume that there exists a common feature representation for labels in each topic. Then, feature-label correlation can be exploited in the latent topic space. This paper shows that the sample and feature exaction, which are two important procedures for removing noisy and redundant information encoded in training samples in both sample and feature perspectives, can be effectively and efficiently performed in the latent topic space by considering topic-based feature-label correlation. Empirical studies on several benchmarks demonstrate the effectiveness and efficiency of the proposed topic-aware framework."



Paperid:630
Authors:Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq Joty, Jianfei Cai
Title: Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning
Abstract:
Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors such as viewpoint changes. However, in practice, viewpoint changes happen often and can overwhelm the semantic difference to be described. In this paper, we propose a novel visual encoder to explicitly distinguish viewpoint changes from semantic changes in the change captioning task. Moreover, we further simulate the attention preference of humans and propose a novel reinforcement learning process to fine-tune the attention directly with the language evaluation rewards. Extensive experimental results show that our method outperforms the state-of-the-art approaches by a large margin in both Spot-the-Diff and CLEVR-Change datasets."



Paperid:631
Authors:Taekyung Kim, Changick Kim
Title: Attract, Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation
Abstract:
Perturb, and Explore: Learning a Feature Alignment Network for Semi-supervised Domain Adaptation","Although unsupervised domain adaptation methods have been widely adopted across several computer vision tasks, it is more desirable if we can exploit a few labeled data from new domains encountered in a real application. The novel setting of the semi-supervised domain adaptation (SSDA) problem shares the challenges with the domain adaptation problem and the semi-supervised learning problem. However, a recent study shows that conventional domain adaptation and semi-supervised learning methods often result in less effective or negative transfer in the SSDA problem. In order to interpret the observation and address the SSDA problem, in this paper, we raise the intra-domain discrepancy issue within the target domain, which has never been discussed so far. Then, we demonstrate that addressing the intra-domain discrepancy leads to the ultimate goal of the SSDA problem. We propose an SSDA framework that aims to align features via alleviation of the intra-domain discrepancy. Our framework mainly consists of three schemes, i.e., attraction, perturbation, and exploration. First, the attraction scheme globally minimizes the intra-domain discrepancy within the target domain. Second, we demonstrate the incompatibility of the conventional adversarial perturbation methods with SSDA. Then, we present a domain adaptive adversarial perturbation scheme, which perturbs the given target samples in a way that reduces the intra-domain discrepancy. Finally, the exploration scheme locally aligns features in a class-wise manner complementary to the attraction scheme by selectively aligning unlabeled target features complementary to the perturbation scheme. We conduct extensive experiments on domain adaptation benchmark datasets such as DomainNet, Office-Home, and Office. Our method achieves state-of-the-art performances on all datasets."



Paperid:632
Authors:Luyu Yang, Yogesh Balaji, Ser-Nam Lim, Abhinav Shrivastava
Title: Curriculum Manager for Source Selection in Multi-Source Domain Adaptation
Abstract:
The performance of Multi-Source Unsupervised Domain Adaptation (MS-UDA) depends significantly on the effectiveness of transferring from labeled source domain samples. In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). The curriculum manager, an independent network module, constantly updates the curriculum during training and iteratively learns which domains or samples are best suited for aligning to the target. The intuition behind is to force the curriculum manager to constantly re-measure the tranferability of latent domains over time to adversarially raise the error rate of the domain discriminator. CMSS does not require any knowledge of the domain labels yet it outperforms other methods on four well-known benchmarks by significant margins. We also provide interpretable results that sheds light on the proposed method. "



Paperid:633
Authors:Ronghao Guo, Chen Lin, Chuming Li, Keyu Tian, Ming Sun, Lu Sheng, Junjie Yan
Title: Powering One-shot Topological NAS with Stabilized Share-parameter Proxy
Abstract:
One-shot NAS method has attracted much interest from the research community due to its remarkable training efficiency and capacity to discover high performance models. However, the search spaces of previous one-shot based works usually relied on hand-craft design and were short for flexibility on the network topology. In this work, we try to enhance the one-shot NAS by exploring high-performing network architectures in our large-scale Topology Augmented Search Space (i.e., over 3.4×10^10 different topological structures). Specifically, the difficulties for architecture searching in such a complex space has been eliminated by the proposed stabilized share-parameter proxy, which employs Stochastic Gradient Langevin Dynamics to enable fast shared parameter sampling, so as to achieve stabilized measurement of architecture performance even in search space with complex topological structures. The proposed method, namely Stablized Topological Neural Architecture Search (ST-NAS), achieves state-of-the-art performance under Multiply-Adds (MAdds) constraint on ImageNet. Our lite model ST-NAS-A achieves 76.4% top-1 accuracy with only 326M MAdds. Our moderate model ST-NAS-B achieves 77.9% top-1 accuracy just required 503M MAdds. Both of our models offer superior performances in comparison to other concurrent works on one-shot NAS."



Paperid:634
Authors:Haoran Wang, Tong Shen, Wei Zhang, Ling-Yu Duan, Tao Mei
Title: Classes Matter: A Fine-grained Adversarial Approach to Cross-domain Semantic Segmentation
Abstract:
Despite great progress in supervised semantic segmentation, a large performance drop is usually observed when deploying the model in the wild. Domain adaptation methods tackle the issue by aligning the source domain and the target domain. However, most existing methods attempt to perform the alignment from a holistic view, ignoring the underlying class-level data structure in the target domain. To fully exploit the supervision in the source domain, we propose a fine-grained adversarial learning strategy for class-level feature alignment while preserving the internal structure of semantics across domains. We adopt a fine-grained domain discriminator that not only plays as a domain distinguisher, but also differentiates domains at class level. The traditional binary domain labels are also generalized to domain encodings as the supervision signal to guide a fine-grained feature alignment. An analysis with Class Center Distance (CCD) validates that our fine-grained adversarial strategy achieves better class-level alignment compared to other state-of-the-art methods. Our method is easy to implement and its effectiveness is evaluated on three classical domain adaptation tasks, i.e., GTA5 $ o$ Cityscapes, SYNTHIA $ o$ Cityscapes, Cityscapes $ o$ Cross-City. Large performance gains show that our method outperforms other global feature alignment based and class-wise alignment based counterparts. The code is publicly available at https://github.com/JDAI-CV/FADA."



Paperid:635
Authors:Tianheng Cheng, Xinggang Wang, Lichao Huang, Wenyu Liu
Title: Boundary-preserving Mask R-CNN
Abstract:
Tremendous efforts have been made to improve mask localization accuracy in instance segmentation. Modern in-stance segmentation methods relying on fully convolutional networks perform pixel-wise classification, which ignores object boundaries and shapes, leading coarse and indistinct mask prediction results and imprecise localization. To remedy this, we propose a conceptually simple yet effective Boundary-guided Mask R-CNN (BMask R-CNN) to leverage object boundary information to improve mask localization accuracy. BMask R-CNN contains a boundary-preserving mask head in which object boundary and mask are mutually learned via feature fusion blocks. As a result, the mask prediction results are better aligned with object boundaries. Without bells and whistles, BMask R-CNN outperforms Mask R-CNN by a considerable margin on the COCO dataset; in the Cityscapes dataset, there are more accurate boundary groundtruths available, so that BMask R-CNN obtains remarkable improvements over Mask R-CNN. Besides, it is not surprising to observe that BMask R-CNN obtains more obvious improvement when the evaluation criterion requires better localization (e.g., AP75) as shown in Fig. 1. Code and models are available at \url{https://github.com/hustvl/BMaskR-CNN}."



Paperid:636
Authors:Xueting Li, Sifei Liu, Kihwan Kim, Shalini De Mello, Varun Jampani, Ming-Hsuan Yang, Jan Kautz
Title: Self-supervised Single-view 3D Reconstruction via Semantic Consistency
Abstract:
We learn a self-supervised, single-view 3D reconstruction model that predicts the 3D mesh shape, texture and camera pose of a target object with a collection of 2D images and silhouettes. The proposed method does not necessitate 3D supervision, manually annotated keypoints, multi-view images of an object or a prior 3D template. The key insight of our work is that objects can be represented as a collection of deformable parts, and each part is semantically coherent across different instances of the same category (e.g., wings on birds and wheels on cars). Therefore, by leveraging self-supervisedly learned part segmentation of a large collection of category-specific images, we can effectively enforce semantic consistency between the reconstructed meshes and the original images. This significantly reduces ambiguities during joint prediction of shape and camera pose of an object, along with texture. To the best of our knowledge, we are the first to try and solve the single-view reconstruction problem without a category-specific template mesh or semantic keypoints. Thus our model can easily generalize to various object categories without such labels, e.g., horses, penguins, etc. Through a variety of experiments on several categories of deformable and rigid objects, we demonstrate that our unsupervised method performs comparably if not better than existing category-specific reconstruction methods learned with supervision."



Paperid:637
Authors:Benlin Liu, Yongming Rao, Jiwen Lu, Jie Zhou, Cho-Jui Hsieh
Title: MetaDistiller: Network Self-Boosting via Meta-Learned Top-Down Distillation
Abstract:
Knowledge Distillation (KD) has been one of the most popular used methods to learn a compact model. However, it still suffers from high demand in time and computational resources caused by sequential training pipeline. Furthermore, the soft targets from deeper models do not often serve as good cues for the shallower models due to the gap of compatibility. In this work, we consider these two problems at the sametime. Specifically, we propose that better soft targets with higher compatibility can be generated by using a label generator to fuse the featuremaps from deeper stages in a top-down manner, and we can employ the meta-learning technique to optimize this label generator. Utilizing the soft targets learned from the intermediate feature maps of the model, we can achieve better self-boosting of the network in comparison with the state-of-the-art. The experiments are conducted on two standard classi-fication benchmarks, namely CIFAR-100 and ILSVRC2012. We test various network architectures to show the generalizability of our MetaDistiller. The experiments results on two datasets strongly demonstrate theeffectiveness of our method."



Paperid:638
Authors:Yuliang Zou, Pan Ji, Quoc-Huy Tran, Jia-Bin Huang, Manmohan Chandraker
Title: Learning Monocular Visual Odometry via Self-Supervised Long-Term Modeling
Abstract:
Monocular visual odometry (VO) suffers severely from error accumulation during frame-to-frame pose estimation. In this paper, we present a self-supervised learning method for VO with special consideration for consistency over longer sequences. To this end, we model the long-term dependency in pose prediction using a pose network that features a two-layer convolutional LSTM module. We train the networks with purely self-supervised losses, including a cycle consistency loss that mimics the loop closure module in geometric VO. Inspired by prior geometric systems, we allow the networks to see beyond a small temporal window during training, through a novel a loss that incorporates temporally distant (g $O(100)$) frames. Given GPU memory constraints, we propose a stage-wise training mechanism, where the first stage operates in a local time window and the second stage refines the poses with a ``global'' loss given the first stage features. We demonstrate competitive results on several standard VO datasets, including KITTI and TUM RGB-D."



Paperid:639
Authors:Tao Wang, Yu Li, Bingyi Kang, Junnan Li, Junhao Liew, Sheng Tang, Steven Hoi, Jiashi Feng
Title: The Devil is in Classification: A Simple Framework for Long-tail Instance Segmentation
Abstract:
Most existing object instance detection and segmentation models only work well on fairly balanced benchmarks where per-category training sample numbers are comparable, such as COCO. They tend to suffer performance drop on realistic datasets that are usually long-tailed. This work aims to study and address such open challenges. Specifically, we systematically investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset, and unveil that a major cause is the inaccurate classification of object proposals. Based on such an observation, we first consider various techniques for improving long-tail classification performance which indeed enhance instance segmentation results. We then propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach. Without bells and whistles, it significantly boosts the performance of instance segmentation for tail classes on the recent LVIS dataset and our sampled COCO-LT dataset. Our analysis provides useful insights for solving long-tail instance detection and segmentation problems, and the straightforward mph{SimCal} method can serve as a simple but strong baseline. With the method we have won the 2019 LVIS challenge. Codes and models are available at \url{https://github.com/twangnh/SimCal}. "



Paperid:640
Authors:Guanying Chen, Michael Waechter, Boxin Shi, Kwan-Yee K. Wong, Yasuyuki Matsushita
Title: What is Learned in Deep Uncalibrated Photometric Stereo?
Abstract:
This paper targets at discovering what a deep uncalibrated photometric stereo network learns to resolve the problem’s inherent ambiguity, and designing an effective network architecture based on the new insight to improve the performance. The recently proposed deep uncalibrated photometric stereo method achieved promising results in estimating directional lightings. However, what specifically inside the network contributes to its success remains a mystery. In this paper, we analyze the features learned by this method and find that they strikingly resemble attached shadows, shadings, and specular highlights, which are known to provide useful clues in resolving the generalized bas-relief (GBR) ambiguity. Based on this insight, we propose a guided calibration network, named GCNet, that explicitly leverages object shape and shading information for improved lighting estimation. Experiments on synthetic and real datasets show that GCNet achieves improved results in lighting estimation for photometric stereo, which echoes the findings of our analysis. We further demonstrate that GCNet can be directly integrated with existing calibrated methods to achieve improved results on surface normal estimation."



Paperid:641
Authors:Vishwanath A. Sindagi, Poojan Oza, Rajeev Yasarla, Vishal M. Patel
Title: Prior-based Domain Adaptive Object Detection for Hazy and Rainy Conditions
Abstract:
Adverse weather conditions such as haze and rain corrupt the quality of captured images, which cause detection networks trained on clean images to perform poorly on these corrupted images. To address this issue, we propose an unsupervised prior-based domain adversarial object detection framework for adapting the detectors to hazy and rainy conditions. In particular, we use weather-specific prior knowledge obtained using the principles of image formation to define a novel prior-adversarial loss. The prior-adversarial loss, which we use to supervise the adaptation process, aims to reduce the weather-specific information in the features, thereby mitigating the effects of weather on the detection performance. Additionally, we introduce a set of residual feature recovery blocks in the object detection pipeline to de-distort the feature space, resulting in further improvements. Evaluations performed on various datasets (Foggy-Cityscapes, Rainy-Cityscapes, RTTS and UFDD) for rainy and hazy conditions demonstrates the effectiveness of the proposed approach."



Paperid:642
Authors:Mo Zhou, Zhenxing Niu, Le Wang, Qilin Zhang, Gang Hua
Title: Adversarial Ranking Attack and Defense
Abstract:
Deep Neural Network (DNN) classifiers are vulnerable to adversarial attack, where an imperceptible perturbation could result in misclassification. However, the vulnerability of DNN-based image ranking systems remains under-explored. In this paper, we propose two attacks against deep ranking systems, i.e., Candidate Attack and Query Attack, that can raise or lower the rank of chosen candidates by adversarial perturbations. Specifically, the expected ranking order is first represented as a set of inequalities, and then a triplet-like objective function is designed to obtain the optimal perturbation. Conversely, a defense method is also proposed to improve the ranking system robustness, which can mitigate all the proposed attacks simultaneously. Our adversarial ranking attacks and defense are evaluated on datasets including MNIST, Fashion-MNIST, and Stanford-Online-Products. Experimental results demonstrate that a typical DNN-based ranking system can be effectively compromised by our attacks. Meanwhile, the system robustness can be moderately improved with our defense. Furthermore, the transferable and universal properties of our adversary illustrate the possibility of realistic black-box attack."



Paperid:643
Authors:Saimunur Rahman, Lei Wang, Changming Sun, Luping Zhou
Title: ReDro: Efficiently Learning Large-sized SPD Visual Representation
Abstract:
Symmetric positive definite (SPD) matrix has recently been used as an effective visual representation. When learning this representation in deep networks, eigen-decomposition of covariance matrix is usually needed for a key step called matrix normalisation. This could result in significant computational cost, especially when facing the increasing number of channels in recent advanced deep networks. This work proposes a novel scheme called Relation Dropout (ReDro). It is inspired by the fact that eigen-decomposition of a extit{block diagonal} matrix can be efficiently obtained by decomposing each of its diagonal square matrices, which are of smaller sizes. Instead of using a full covariance matrix as in the literature, we generate a block diagonal one by randomly grouping the channels and only considering the covariance within the same group. We insert ReDro as an additional layer before the step of matrix normalisation and make its random grouping transparent to all subsequent layers. Additionally, we can view the ReDro scheme as a dropout-like regularisation, which drops the channel relationship across groups. As experimentally demonstrated, for the SPD methods typically involving the matrix normalisation step, ReDro can effectively help them reduce computational cost in learning large-sized SPD visual representation and also help to improve image recognition performance."



Paperid:644
Authors:Wanhua Li, Yueqi Duan, Jiwen Lu, Jianjiang Feng, Jie Zhou
Title: Graph-Based Social Relation Reasoning
Abstract:
Human beings are fundamentally sociable --- that we generally organize our social lives in terms of relations with other people. Understanding social relations from an image has great potential for intelligent systems such as social chatbots and personal assistants. In this paper, we propose a simpler, faster, and more accurate method named graph relational reasoning network (GR$^2$N) for social relation recognition. Different from existing methods which process all social relations on an image independently, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. Furthermore, the proposed GR$^2$N constructs several virtual relation graphs to explicitly grasp the strong logical constraints among different types of social relations. Experimental results illustrate that our method generates a reasonable and consistent social relation graph and improves the performance in both accuracy and efficiency."



Paperid:645
Authors:Tengteng Huang, Zhe Liu, Xiwu Chen, Xiang Bai
Title: EPNet: Enhancing Point Features with Image Semantics for 3D Object Detection
Abstract:
In this paper, we aim at addressing two critical issues in the 3D detection task, including the exploitation of multiple sensors (namely LiDAR point cloud and camera image), as well as the inconsistency between the localization and classification confidence. To this end, we propose a novel fusion module to enhance the point features with semantic image features in a point-wise manner. Besides, a consistence forcing loss is employed to explicitly encourage the consistency of both the localization and classification confidence. We design an end-to-end learnable framework named EPNet to integrate these two components. Extensive experiments on the KITTI and SUN-RGBD datasets demonstrate the superiority of EPNet over the state-of-the-art methods. "



Paperid:646
Authors:Jiaxiang Shang, Tianwei Shen, Shiwei li, Lei Zhou, Mingmin Zhen, Tian Fang, Long Quan
Title: Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency
Abstract:
Recent learning-based approaches, in which models are trained by single-view images have shown promising results for monocular 3D face reconstruction, but they suffer from the ill-posed face pose and depth ambiguity issue. In contrast to previous works that only enforce 2D feature constraints, we propose a self-supervised training architecture by leveraging the multi-view geometry consistency, which provides reliable constraints on face pose and depth estimation. We first propose an occlusion-aware view synthesis method to apply multi-view geometry consistency to self-supervised learning. Then we design three novel loss functions for multi-view consistency, including the pixel consistency loss, the depth consistency loss, and the facial landmark-based epipolar loss. Our method is accurate and robust, especially under large variations of expressions, poses, and illumination conditions. Comprehensive experiments on the face alignment and 3D face reconstruction benchmarks have demonstrated superiority over state-of-the-art methods."



Paperid:647
Authors:Jiajun Tang, Jin Xia, Xinzhi Mu, Bo Pang, Cewu Lu
Title: Asynchronous Interaction Aggregation for Action Detection
Abstract:
Understanding interaction is an essential part of video action detection. We propose the Asynchronous Interaction Aggregation network (AIA) that leverages different interactions to boost action detection. There are two key designs in it: one is the Interaction Aggregation structure (IA) adopting a uniform paradigm to model and integrate multiple types of interaction; the other is the Asynchronous Memory Update algorithm (AMU) that enables us to achieve better performance by modeling very long-term interaction dynamically without huge computation cost. We provide empirical evidence to show that our network can gain notable accuracy from the integrative interactions and is easy to train end-to-end. Our method reports the new state-of-the-art performance on AVA dataset, with 3.7 mAP gain (12.6% relative improvement) on validation split comparing to our strong baseline. The results on datasets UCF101-24 and EPIC-Kitchens further illustrate the effectiveness of our approach. Source code will be made public at: https://github.com/MVIG-SJTU/AlphAction ."



Paperid:648
Authors:Shubham Goel, Angjoo Kanazawa, Jitendra Malik
Title: Shape and Viewpoint without Keypoints
Abstract:
We present a learning framework that learns to recover the 3D shape, pose and texture from a single image, trained on an image collection without any ground truth 3D shape, multi-view, camera viewpoints or keypoint supervision. We approach this highly under-constrained problem in a ``analysis by synthesis"" framework where the goal is to predict the likely shape, texture and camera viewpoint that could produce the image with various learned category-specific priors. Our particular contribution in this paper is a representation of the distribution over cameras, which we call ``camera-multiplex"". Instead of picking a point estimate, we maintain a set of camera hypotheses that are optimized during training to best explain the image given the current shape and texture. We call our approach Unsupervised Category-Specific Mesh Reconstruction (U-CMR), and present qualitative and quantitative results on CUB, Pascal 3D and new web-scraped datasets. We obtain state-of-the-art camera prediction results and show that we can learn to predict diverse shapes and textures across objects using an image collection without any keypoint annotations or 3D ground truth. Project page: https://shubham-goel.github.io/ucmr"



Paperid:649
Authors:Jiaxin Chen, Jie Qin, Yuming Shen, Li Liu, Fan Zhu, Ling Shao
Title: Learning Attentive and Hierarchical Representations for 3D Shape Recognition
Abstract:
This paper proposes a novel method for 3D shape representation learning, namely Hyperbolic Embedded Attentive Representation (HEAR). Different from existing multi-view based methods, HEAR develops a unified framework to address both multi-view redundancy and single-view incompleteness. Specifically, HEAR firstly employs a hybrid attention (HA) module, which consists of a view-agnostic attention (VAA) block and a view-specific attention (VSA) block. These two blocks jointly explore distinct but complementary spatial saliency of local features for each single-view image. Subsequently, a multi-granular view pooling (MVP) module is introduced to aggregate the multi-view features with different granularities in a coarse-to-fine manner. The resulting feature set implicitly has hierarchical relations, which are therefore projected into a Hyperbolic space by adopting the Hyperbolic embedding. A hierarchical representation is learned by Hyperbolic multi-class logistic regression based on the Hyperbolic geometry. Experimental results clearly show that HEAR outperforms the state-of-the-art approaches on three 3D shape recognition tasks including generic 3D shape retrieval, 3D shape classification and sketch-based 3D shape retrieval. "



Paperid:650
Authors:Yibo Hu, Xiang Wu, Ran He
Title: TF-NAS: Rethinking Three Search Freedoms of Latency-Constrained Differentiable Neural Architecture Search
Abstract:
With the flourish of differentiable neural architecture search (NAS), automatically searching latency-constrained architectures gives a new perspective to reduce human labor and expertise. However, the searched architectures are usually suboptimal in accuracy and may have large jitters around the target latency. In this paper, we rethink three freedoms of differentiable NAS, i.e. operation-level, depth-level and width-level, and propose a novel method, named Three-Freedom NAS (TF-NAS), to achieve both good classification accuracy and precise latency constraint. For the operation-level, we present a bi-sampling search algorithm to moderate the operation collapse. For the depth-level, we introduce a sink-connecting search space to ensure the mutual exclusion between skip and other candidate operations, as well as eliminate the architecture redundancy. For the width-level, we propose an elasticity-scaling strategy that achieves precise latency constraint in a progressively fine-grained manner. Experiments on ImageNet demonstrate the effectiveness of TF-NAS. Particularly, our searched TF-NAS-A obtains 76.9% top-1 accuracy, achieving state-of-the-art results with less latency. Code is available at https://github.com/AberHu/TF-NAS."



Paperid:651
Authors:Shengyi Qian, Linyi Jin, David F. Fouhey
Title: Associative3D: Volumetric Reconstruction from Sparse Views
Abstract:
This paper studies the problem of 3D volumetric reconstruction from two views of a scene with an unknown camera. While seemingly easy for humans, this problem poses many challenges for computers since it requires simultaneously reconstructing objects in the two views while also figuring out their relationship. We propose a new approach that estimates reconstructions, distributions over the camera/object and camera/camera transformations, as well as an inter-view object affinity matrix. This information is then jointly reasoned over to produce the most likely explanation of the scene. We train and test our approach on a dataset of indoor scenes, and rigorously evaluate the merits of our joint reasoning approach. Our experiments show that it is able to recover reasonable scenes from sparse views, while the problem is still challenging. Project site: https://jasonqsy.github.io/Associative3D"



Paperid:652
Authors:Yongqiang Mou, Lei Tan, Hui Yang, Jingying Chen, Leyuan Liu, Rui Yan, Yaohong Huang
Title: PlugNet: Degradation Aware Scene Text Recognition Supervised by a Pluggable Super-Resolution Unit
Abstract:
In this paper, we address the problem of recognizing degradation images that are suffering from high blur or low-resolution. We propose a novel degradation aware scene text recognizer with a pluggable super-resolution unit (PlugNet) to recognize low-quality scene text to solve this task from the feature-level. The whole networks can be trained end-to-end with a pluggable super-resolution unit (PSU) and the PSU will be removed after training so that it brings no extra computation. The PSU aims to obtain a more robust feature representation for recognizing low-quality text images. Moreover, to further improve the feature quality, we introduce two types of feature enhancement strategies: Feature Squeeze Module (FSM) which aims to reduce the loss of spatial acuity and Feature Enhance Module (FEM) which combines the feature maps from low to high to provide diversity semantics. As a consequence, the PlugNet achieves state-of-the-art performance on various widely used text recognition benchmarks like IIIT5K, SVT, SVTP, ICDAR15 and etc."



Paperid:653
Authors:Ruizheng Wu, Huaijia Lin, Xiaojuan Qi, Jiaya Jia
Title: Memory Selection Network for Video Propagation
Abstract:
Video propagation is a fundamental problem in video processing where guidance frame predictions are propagated to guide predictions of the target frame. Previous research mainly treats the previous adjacent frame as guidance, which, however, could make the propagation vulnerable to occlusion, large motion, and inaccurate information in the previous adjacent frame. To tackle this challenge, we propose a memory selection network, which learns to select suitable guidance from all previous frames for effective and robust propagation. Experimental results on video object segmentation and video colorization tasks show that our method consistently improves performance and can robustly handle challenging scenarios in video propagation."



Paperid:654
Authors:Minghao Yin, Zhuliang Yao, Yue Cao, Xiu Li, Zheng Zhang, Stephen Lin, Han Hu
Title: Disentangled Non-local Neural Networks
Abstract:
The non-local block is a popular module for strengthening the context modeling ability of a regular convolutional neural network. This paper first studies the non-local block in depth, where we find that its attention computation can be split into two terms, a whitened pairwise term accounting for the relationship between two pixels and a unary term representing the saliency of every pixel. We also observe that the two terms trained alone tend to model different visual clues, e.g. the whitened pairwise term learns within-region relationships while the unary term learns salient boundaries. However, the two terms are tightly coupled in the non-local block, which hinders the learning of each. Based on these findings, we present the disentangled non-local block, where the two terms are decoupled to facilitate learning for both terms. We demonstrate the effectiveness of the decoupled design on various tasks, including semantic segmentation, object detection and action recognition. The code will be made publicly available."



Paperid:655
Authors:Seonguk Seo, Joon-Young Lee, Bohyung Han
Title: URVOS: Unified Referring Video Object Segmentation Network with a Large-Scale Benchmark
Abstract:
We propose a unified referring video object segmentation network (URVOS). URVOS takes a video and a referring expression as inputs, and estimates the {object masks} referred by the given language expression in the whole video frames. Our algorithm addresses the challenging problem by performing language-based object segmentation and mask propagation jointly using a single deep neural network with a proper combination of two attention models. In addition, we construct the first large-scale referring video object segmentation dataset called Refer-Youtube-VOS. We evaluate our model on two benchmark datasets including ours and demonstrate the effectiveness of the proposed approach. The dataset is released at \url{https://github.com/skynbe/Refer-Youtube-VOS}."



Paperid:656
Authors:Chuanchen Luo, Chunfeng Song, Zhaoxiang Zhang
Title: Generalizing Person Re-Identification by Camera-Aware Invariance Learning and Cross-Domain Mixup
Abstract:
Despite the impressive performance under the single-domain setup, current fully-supervised models for person re-identification (re-ID) degrade significantly when deployed to an unseen domain. According to the characteristics of cross-domain re-ID, such degradation is mainly attributed to the dramatic variation within the target domain and the severe shift between the source and target domain. To achieve a model that generalizes well to the target domain, it is desirable to take both issues into account. In terms of the former issue, one of the most successful solutions is to enforce consistency between nearest-neighbors in the embedding space. However, we find that the search of neighbors is highly biased due to the discrepancy across cameras. To this end, we improve the vanilla neighborhood invariance approach by imposing the constraint in a camera-aware manner. As for the latter issue, we propose a novel cross-domain mixup scheme. It alleviates the abrupt transfer by introducing the interpolation between the two domains as a transition state. Extensive experiments on three public benchmarks demonstrate the superiority of our method. Without any auxiliary data or models, it outperforms existing state-of-the-arts by a large margin. The code is available at https://github.com/LuckyDC/generalizing-reid."



Paperid:657
Authors:Yan Liu, Lingqiao Liu, Peng Wang, Pingping Zhang, Yinjie Lei
Title: Semi-Supervised Crowd Counting via Self-Training on Surrogate Tasks
Abstract:
Most existing crowd counting systems rely on the availability of the object location annotation which can be expensive to obtain. To reduce the annotation cost, one attractive solution is to leverage a large number of unlabeled images to build a crowd counting model in semi-supervised fashion. This paper tackles the semi-supervised crowd counting problem from the perspective of feature learning. Our key idea is to leverage the unlabeled images to train a generic feature extractor rather than the entire network of a crowd counter. The rationale of this design is that learning the feature extractor can be more reliable and robust towards the inevitable noisy supervision generated from the unlabeled data. Also, on top of a good feature extractor, it is possible to build a density map regressor with much fewer density map annotations. Specifically, we proposed a novel semi-supervised crowd counting method which is built upon two innovative components: (1) a set of inter-related binary segmentation tasks are derived from the original density map regression task as the surrogate prediction target; (2) the surrogate target predictors are learned from both labeled and unlabeled data by utilizing a proposed self-training scheme which fully exploits the underlying constraints of these binary segmentation tasks. Through experiments, we show that the proposed method is superior over the existing semi-supervised crowd counting method and other representative baselines."



Paperid:658
Authors:Hongkai Zhang, Hong Chang, Bingpeng Ma, Naiyan Wang, Xilin Chen
Title: Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training
Abstract:
Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. Consequently, we propose Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training. This dynamic design makes better use of the training samples and pushes the detector to fit more high quality samples. Specifically, our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_{90}$ on the MS COCO dataset with no extra overhead. Codes and models are available at https://github.com/hkzhang95/DynamicRCNN."



Paperid:659
Authors:Weilun Chen, Zhaoxiang Zhang, Xiaolin Hu, Baoyuan Wu
Title: Boosting Decision-based Black-box Adversarial Attacks with Random Sign Flip
Abstract:
Decision-based black-box adversarial attacks (decision-based attack) pose a severe threat to current deep neural networks, as they only need the predicted label of the target model to craft adversarial examples. However, existing decision-based attacks perform poorly on the $ l_\infty $ setting and the required enormous queries cast a shadow over the practicality. In this paper, we show that just randomly flipping the signs of a small number of entries in adversarial perturbations can significantly boost the attack performance. We name this simple and highly efficient decision-based $ l_\infty $ attack as Sign Flip Attack. Extensive experiments on CIFAR-10 and ImageNet show that the proposed method outperforms existing decision-based attacks by large margins and can serve as a strong baseline to evaluate the robustness of defensive models. We further demonstrate the applicability of the proposed method on real-world systems."



Paperid:660
Authors:Anbang Yao, Dawei Sun
Title: Knowledge Transfer via Dense Cross-Layer Mutual-Distillation
Abstract:
Knowledge Distillation (KD) based methods adopt the one-way Knowledge Transfer (KT) scheme in which training a lower-capacity student network is guided by a pre-trained high-capacity teacher network. Recently, Deep Mutual Learning (DML) presented a two-way KT strategy, showing that the student network can be also helpful to improve the teacher network. In this paper, we propose Dense Cross-layer Mutual-distillation (DCM), an improved two-way KT method in which the teacher and student networks are trained collaboratively from scratch. To augment knowledge representation learning, well-designed auxiliary classifiers are added to certain hidden layers of both teacher and student networks. To boost KT performance, we introduce dense bidirectional KD operations between the layers appended with classifiers. After training, all auxiliary classifiers are discarded, and thus there are no extra parameters introduced to final models. We test our method on a variety of KT tasks, showing its superiorities over related methods. Code is available at https://github.com/sundw2014/DCM"



Paperid:661
Authors:Kaiyu Yue, Jiangfan Deng, Feng Zhou
Title: Matching Guided Distillation
Abstract:
Feature distillation is an effective way to improve the performance for a smaller student model, which has fewer parameters and lower computation cost compared to the larger teacher model. Unfortunately, there is a common obstacle $-$ the gap in semantic feature structure between the intermediate features of teacher and student. The classic scheme prefers to transform intermediate features by adding the adaptation module, such as naive convolutional, attention-based or more complicated one. However, this introduces two problems: a) The adaptation module brings more parameters into training. b) The adaptation module with random initialization or special transformation isn't friendly for distilling a pre-trained student. In this paper, we present Matching Guided Distillation (MGD) as an efficient and parameter-free manner to solve these problems. The key idea of MGD is to pose matching the teacher channels with students' as an assignment problem. We compare three solutions of the assignment problem to reduce channels from teacher features with partial distillation loss. The overall training takes a coordinate-descent approach between two optimization objects $-$ assignments update and parameters update. Since MGD only contains normalization or pooling operations with negligible computation cost, it is flexible to plug into network with other distillation methods. The project site is http://kaiyuyue.com/mgd ."



Paperid:662
Authors:Yunpeng Chang, Zhigang Tu, Wei Xie, Junsong Yuan
Title: Clustering Driven Deep Autoencoder for Video Anomaly Detection
Abstract:
Because of the ambiguous definition of anomaly and the complexity of real data, anomaly detection in videos is one of the most challenging problems in intelligent video surveillance. Since the abnormal events are usually different from normal events in appearance and/or in motion behavior, we address this issue by designing a novel convolution autoencoder architecture to separately capture spatial and temporal informative representation. The spatial part reconstructs the last individual frame (LIF), and the temporal part generates the RGB difference between the rest of video frames and the LIF, where the fast obtained RGB difference cue can learn useful motion features. Two sub-modules independently learn the regularity from appearance and motion feature space, the abnormal events which are irregular in appearance or in motion behavior lead to a large reconstruction error. Besides, we design a deep k-means cluster constraint to force both the appearance encoder and the motion encoder to extract common factors of variation within the dataset by penalizing the distance of each data representation to cluster centers. Experiments on some publicly available datasets demonstrate the effectiveness of our method which detects abnormal events in videos with competitive performance."



Paperid:663
Authors:Juhong Min, Jongmin Lee, Jean Ponce, Minsu Cho
Title: Learning to Compose Hypercolumns for Visual Correspondence
Abstract:
Feature representation plays a crucial role in visual correspondence, and recent methods for image matching resort to deeply stacked convolutional layers. These models, however, are both monolithic and static in the sense that they typically use a specific level of features, e.g., the output of the last layer, and adhere to it regardless of the images to match. In this work, we introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match. Inspired by both multi-layer feature composition in object detection and adaptive inference architectures in classification, the proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network. We demonstrate the effectiveness on the task of semantic correspondence, i.e., establishing correspondences between images depicting different instances of the same object or scene category. Experiments on standard benchmarks show that the proposed method greatly improves matching performance over the state of the art in an adaptive and efficient manner."



Paperid:664
Authors:Lei Zhou, Zixin Luo, Mingmin Zhen, Tianwei Shen, Shiwei Li, Zhuofei Huang, Tian Fang, Long Quan
Title: Stochastic Bundle Adjustment for Efficient and Scalable 3D Reconstruction
Abstract:
Current bundle adjustment solvers such as the Levenberg-Marquardt (LM) algorithm are limited by the bottleneck in solving the Reduced Camera System (RCS) whose dimension is proportional to the camera number. When the problem is scaled up, this step is neither efficient in computation nor manageable for a single compute node. In this work, we propose a stochastic bundle adjustment algorithm which seeks to decompose the RCS approximately inside the LM iterations to improve the efficiency and scalability. It first reformulates the quadratic programming problem of an LM iteration based on the clustering of the visibility graph by introducing the equality constraints across clusters. Then, we propose to relax it into a chance constrained problem and solve it through sampled convex program. The relaxation is intended to eliminate the interdependence between clusters embodied by the constraints, so that a large RCS can be decomposed into independent linear sub-problems. Numerical experiments on unordered Internet image sets and sequential SLAM image sets, as well as distributed experiments on large-scale datasets, have demonstrated the high efficiency and scalability of the proposed approach."



Paperid:665
Authors:Xin Wei, Guojun Chen, Yue Dong, Stephen Lin, Xin Tong
Title: Object-based Illumination Estimation with Rendering-aware Neural Networks
Abstract:
We present a scheme for fast environment light estimation from the RGBD appearance of individual objects and their local image areas. Conventional inverse rendering is too computationally demanding for real-time applications, and the performance of purely learning-based techniques may be limited by the meager input data available from individual objects. To address these issues, we propose an approach that takes advantage of physical principles from inverse rendering to constrain the solution, while also utilizing neural networks to expedite the more computationally expensive portions of its processing, to increase robustness to noisy input data as well as to improve temporal and spatial stability. This results in a rendering-aware system that estimates the local illumination distribution at an object with high accuracy and in real time. With the estimated lighting, virtual objects can be rendered in AR scenarios with shading that is consistent to the real scene, leading to improved realism."



Paperid:666
Authors:Le Hui, Rui Xu, Jin Xie, Jianjun Qian, Jian Yang
Title: Progressive Point Cloud Deconvolution Generation Network
Abstract:
In this paper, we propose an effective point cloud generation method, which can generate multi-resolution point clouds of the same shape from a latent vector. Specifically, we develop a novel progressive deconvolution network with the learning-based bilateral interpolation. The learning-based bilateral interpolation is performed in the spatial and feature spaces of point clouds so that local geometric structure information of point clouds can be exploited. Starting from the low-resolution point clouds, with the bilateral interpolation and max-pooling operations, the deconvolution network can progressively output high-resolution local and global feature maps. By concatenating different resolutions of local and global feature maps, we employ the multi-layer perceptron as the generation network to generate multi-resolution point clouds. In order to keep the shapes of different resolutions of point clouds consistent, we propose a shape-preserving adversarial loss to train the point cloud deconvolution generation network. Experimental results on ShpaeNet and ModelNet datasets demonstrate that our proposed method can yield good performance. Our code is available at https://github.com/fpthink/PDGN."



Paperid:667
Authors:Wenqing Chu, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Rongrong Ji
Title: SSCGAN: Facial Attribute Editing via Style Skip Connections
Abstract:
Existing facial attribute editing methods typically employ an encoder-decoder architecture where the attribute information is expressed as a conditional one-hot vector spatially concatenated with the image or intermediate feature maps. However, such operations only learn the local semantic mapping but ignore global facial statistics. In this work, we focus on solving this issue by editing the channel-wise global information denoted as the style feature. We develop a style skip connection based generative adversarial network, referred to as SSCGAN which enables accurate facial attribute manipulation. Specifically, we inject the target attribute information into multiple style skip connection paths between the encoder and decoder. Each connection extracts the style feature of the latent feature maps in the encoder and then performs a residual learning based mapping function in the global information space guided by the target attributes. In the following, the adjusted style feature will be utilized as the conditional information for instance normalization to transform the corresponding latent feature maps in the decoder. In addition, to avoid the vanishing of spatial details ( extit{e.g.} hairstyle or pupil locations), we further introduce the skip connection based spatial information transfer module. Through the global-wise style and local-wise spatial information manipulation, the proposed method can produce better results in terms of attribute generation accuracy and image quality. Experimental results demonstrate the proposed algorithm performs favorably against the state-of-the-art methods."



Paperid:668
Authors:Hiroki Tokunaga, Brian Kenji Iwana, Yuki Teramoto, Akihiko Yoshizawa , Ryoma Bise
Title: Negative Pseudo Labeling using Class Proportion for Semantic Segmentation in Pathology
Abstract:
In pathological diagnosis, since the proportion of the adenocarcinoma subtypes is related to the recurrence rate and the survival time after surgery, the proportion of cancer subtypes for pathological images has been recorded as diagnostic information in some hospitals. In this paper, we propose a subtype segmentation method that uses such proportional labels as weakly supervised labels. If the estimated class rate is higher than that of the annotated class rate, we generate negative pseudo labels, which indicate, ``input image does not belong to this negative label,'' in addition to standard pseudo labels. It can force out the low confidence samples and mitigate the problem of positive pseudo label learning which cannot label low confident unlabeled samples. Our method outperformed the state-of-the-art semi-supervised learning (SSL) methods."



Paperid:669
Authors:Lei Yang, Qingqiu Huang, Huaiyi Huang, Linning Xu, Dahua Lin
Title: Learn to Propagate Reliably on Noisy Affinity Graphs
Abstract:
Recent works have shown that exploiting unlabeled data through label propagation can substantially reduce the labeling cost, which has been a critical issue in developing visual recognition models. Yet, how to propagate labels reliably, especially on a dataset with unknown outliers, remains an open question. Conventional methods such as linear diffusion lack the capability of handling complex graph structures and may perform poorly when the seeds are sparse. Latest methods based on graph neural networks would face difficulties on performance drop as they scale out to noisy graphs. To overcome these difficulties, we propose a new framework that allows labels to be propagated reliably on large-scale real-world data. This framework incorporates (1) a local graph neural network to predict accurately on varying local structures while maintaining high scalability, and (2) a confidence-based path scheduler that identifies outliers and moves forward the propagation frontier in a prudent way. Experiments on both ImageNet and Ms-Celeb-1M show that our confidence guided framework can significantly improve the overall accuracies of the propagated labels, especially when the graph is very noisy."



Paperid:670
Authors:Xiangxiang Chu, Tianbao Zhou, Bo Zhang, Jixiang Li
Title: Fair DARTS: Eliminating Unfair Advantages in Differentiable Architecture Search
Abstract:
Differentiable Architecture Search (DARTS) is now a widely disseminated weight-sharing neural architecture search method. However, it suffers from well-known performance collapse due to an inevitable aggregation of skip connections. In this paper, we first disclose that its root cause lies in an unfair advantage in exclusive competition. Through experiments, we show that if either of two conditions is broken, the collapse disappears. Thereby, we present a novel approach called Fair DARTS where the exclusive competition is relaxed to be collaborative. Specifically, we let each operation's architectural weight be independent of others. Yet there is still an important issue of discretization discrepancy. We then propose a zero-one loss to push architectural weights towards zero or one, which approximates an expected multi-hot solution. Our experiments are performed on two mainstream search spaces, and we derive new state-of-the-art results on CIFAR-10 and ImageNet."



Paperid:671
Authors:Guodong Wei, Zhiming Cui, Yumeng Liu, Nenglun Chen, Runnan Chen, Guiqing Li, Wenping Wang
Title: TANet: Towards Fully Automatic Tooth Arrangement
Abstract:
Determining optimal target tooth arrangements is a key step of treatment planning in digital orthodontics. Existing practice for specifying the target tooth arrangement involves tedious manual operations with the outcome quality depending heavily on the experience of individual specialists, leading to inefficiency and undesirable variations in treatment results. In this work, we proposed a learning-based method for fast and automatic tooth arrangement. To achieve this, we formulate the tooth arrangement task as a novel structured 6-DOF pose prediction problem and solve it by proposing a new neural network architecture to learn from a large set of clinical data that encode successful orthodontic treatment cases. Our method has been validated with extensive experiments and shows promising results both qualitatively and quantitatively."



Paperid:672
Authors:Bumsoo Kim, Taeho Choi, Jaewoo Kang, Hyunwoo J. Kim
Title: UnionDet: Union-Level Detector Towards Real-Time Human-Object Interaction Detection
Abstract:
Recent advances in deep neural networks have achieved significant progress in detecting individual objects from an image. However, object detection is not sufficient to fully understand a visual scene. Towards a deeper visual understanding, the interactions between objects, especially humans and objects are essential. Most prior works have obtained this information with a bottom-up approach, where the objects are first detected and the interactions are predicted sequentially by pairing the objects. This is a major bottleneck in HOI detection inference time. To tackle this problem, we propose UnionDet, a one-stage meta-architecture for HOI detection powered by a novel union-level detector that eliminates this additional inference stage by directly capturing the region of interaction. Our first, fastest and best performing one-stage detector for human-object interaction shows a significant reduction in interaction prediction time ($4 imes \sim 14 imes$) while outperforming state-of-the-art methods on two public datasets: V-COCO and HICO-DET."



Paperid:673
Authors:Lei Ke, Shichao Li, Yanan Sun, Yu-Wing Tai, Chi-Keung Tang
Title: GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware Supervision
Abstract:
We present a novel end-to-end framework named as GSNet ( extbf{\underline{G}}eometric and extbf{\underline{S}}cene-aware \underline{ extbf{Net}}work), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively."



Paperid:674
Authors:Yikai Wang, Fuchun Sun, Duo Li, Anbang Yao
Title: Resolution Switchable Networks for Runtime Efficient Image Recognition
Abstract:
We propose a general method to train a single convolutional neural network which is capable of switching image resolutions at inference. Thus the running speed can be selected to meet various computational resource limits. Networks trained with the proposed method are named Resolution Switchable Networks (RS-Nets). The basic training framework shares network parameters for handling images which differ in resolution, yet keeps separate batch normalization layers. Though it is parameter-efficient in design, it leads to inconsistent accuracy variations at different resolutions, for which we provide a detailed analysis from the aspect of the train-test recognition discrepancy. A multi-resolution ensemble distillation is further designed, where a teacher is learnt on the fly as a weighted ensemble over resolutions. Thanks to the ensemble and knowledge distillation, RS-Nets enjoy accuracy improvements at a wide range of resolutions compared with individually trained models. Extensive experiments on the ImageNet dataset are provided, and we additionally consider quantization problems. Code and models are available at https://github.com/yikaiw/RS-Nets"



Paperid:675
Authors:Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao , Xiaowei Zhou
Title: SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation
Abstract:
Recovering multi-person 3D poses with absolute scales from a single RGB image is a challenging problem due to the inherent depth and scale ambiguity from a single view. Addressing this ambiguity requires to aggregate various cues over the entire image, such as body sizes, scene layouts, and inter-person relationships. However, most previous methods adopt a top-down scheme that first performs 2D pose detection and then regresses the 3D pose and scale for each detected person individually, ignoring global contextual cues. In this paper, we propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm. Such a single-shot bottom-up scheme allows the system to better learn and reason about the inter-person depth relationship, improving both 3D and 2D pose estimation. The experiments demonstrate that the proposed approach achieves the state-of-the-art performance on the CMU Panoptic and MuPoTS-3D datasets and is applicable to in-the-wild videos."



Paperid:676
Authors:Bo Fu, Zhangjie Cao, Mingsheng Long, Jianmin Wang
Title: Learning to Detect Open Classes for Universal Domain Adaptation
Abstract:
Universal domain adaptation (UDA) transfers knowledge between domains without any constraint on the label sets, extending the applicability of domain adaptation in the wild. In UDA, both the source and target label sets may hold individual labels not shared by the other domain. A mph{de facto} challenge of UDA is to classify the target examples in the shared classes against the domain shift. Another more prominent challenge of UDA is to mark the target examples in the target-individual label set (open classes) as ``unknown''. These two entangled challenges make UDA a highly under-explored problem. Previous work on UDA focuses on the classification of data in the shared classes and uses per-class accuracy as the evaluation metric, which is badly biased to the accuracy of shared classes. However, accurately detecting open classes is the mission-critical task to enable real universal domain adaptation. It further turns UDA problem into a well-established close-set domain adaptation problem. Towards accurate open class detection, we propose Calibrated Multiple Uncertainties (CMU) with a novel transferability measure estimated by a mixture of uncertainty quantities in complementation: entropy, confidence and consistency, defined on conditional probabilities calibrated by a multi-classifier ensemble model. The new transferability measure accurately quantifies the inclination of a target example to the open classes. We also propose a novel evaluation metric called H-score, which emphasizes the importance of both accuracies of the shared classes and the ``unknown'' class. Empirical results under the UDA setting show that CMU outperforms the state-of-the-art domain adaptation methods on all the evaluation metrics, especially by a large margin on the H-score."



Paperid:677
Authors:Zhi Hou, Xiaojiang Peng, Yu Qiao, Dacheng Tao
Title: Visual Compositional Learning for Human-Object Interaction Detection
Abstract:
Human-Object interaction (HOI) detection aims to localize and infer relationships between human and objects in an image. It is challenging because an enormous number of possible combinations of objects and verbs types forms a long-tail distribution. We devise a deep Visual Compositional Learning (VCL) framework, which is a simple yet efficient framework to effectively address this problem. VCL first decomposes an HOI representation into object and verb specific features, and then composes new interaction samples in the feature space via stitching the decomposed features. The integration of decomposition and composition enables VCL to share object and verb features among different HOI samples and images, and to generate new interaction samples and new types of HOI, and thus largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection. Extensive experiments demonstrate that the proposed VCL can effectively improve the generalization of HOI detection on HICO-DET and V-COCO and outperforms the recent state-of-the-art methods on HICO-DET."



Paperid:678
Authors:Shuai Yang, Zhangyang Wang, Jiaying Liu, Zongming Guo
Title: Deep Plastic Surgery: Robust and Controllable Image Editing with Human-Drawn Sketches
Abstract:
Sketch-based image editing aims to synthesize and modify photos based on the structural information provided by the human-drawn sketches. Since sketches are difficult to collect, previous methods mainly use edge maps instead of sketches to train models (referred to as edge-based models). However, human-drawn sketches display great structural discrepancy with edge maps, thus failing edge-based models. Moreover, sketches often demonstrate huge variety among different users, demanding even higher generalizability and robustness for the editing model to work. In this paper, we propose Deep Plastic Surgery, a novel, robust and controllable image editing framework that allows users to interactively edit images using hand-drawn sketch inputs. We present a sketch refinement strategy, as inspired by the coarse-to-fine drawing process of the artists, which we show can help our model well adapt to casual and varied sketches without the need for real sketch training data. Our model further provides a refinement level control parameter that enables users to flexibly define how ``reliable'' the input sketch should be considered for the final output, balancing between sketch faithfulness and output verisimilitude (as the two goals might contradict if the input sketch is drawn poorly). To achieve the multi-level refinement, we introduce a style-based module for level conditioning, which allows adaptive feature representations for different levels in a singe network. Extensive experimental results demonstrate the superiority of our approach in improving the visual quality and user controllablity of image editing over the state-of-the-art methods. "



Paperid:679
Authors:Wonho Bae, Junhyug Noh, Gunhee Kim
Title: Rethinking Class Activation Mapping for Weakly Supervised Object Localization
Abstract:
Weakly supervised object localization (WSOL) is a task of localizing an object in an image only using image-level labels. To tackle the WSOL problem, most previous studies have followed the conventional class activation mapping (CAM) pipeline: (i) training CNNs for a classification objective, (ii) generating a class activation map via global average pooling (GAP) on feature maps, and (iii) extracting bounding boxes by thresholding based on the maximum value of the class activation map. In this work, we reveal the current CAM approach suffers from three fundamental issues: (i) the bias of GAP that assigns a higher weight to a channel with a small activation area, (ii) negatively weighted activations inside the object regions and (iii) instability from the use of the maximum value of a class activation map as a thresholding reference. They collectively cause the problem that the localization to be highly limited to small regions of an object. We propose three simple but robust techniques that alleviate the problems, including thresholded average pooling, negative weight clamping, and percentile as a standard for thresholding. Our solutions are universally applicable to any WSOL methods using CAM and improve their performance drastically. As a result, we achieve the new state-of-the-art performance on three benchmark datasets of CUB-200-2011, ImageNet-1K, and OpenImages30K."



Paperid:680
Authors:Anton Osokin, Denis Sumin, Vasily Lomakin
Title: OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features
Abstract:
In this paper, we consider the task of one-shot object detection, which consists in detecting objects defined by a single demonstration. Differently from the standard object detection, the classes of objects used for training and testing do not overlap. We build the one-stage system that performs localization and recognition jointly. We use dense correlation matching of learned local features to find correspondences, a feed-forward geometric transformation model to align features and bilinear resampling of the correlation tensor to compute the detection score of the aligned features. All the components are differentiable, which allows end-to-end training. Experimental evaluation on several challenging domains (retail products, 3D objects, buildings and logos) shows that our method can detect unseen classes (e.g., toothpaste when trained on groceries) and outperforms several baselines by a significant margin. Our code is available online: https://github.com/aosokin/os2d"



Paperid:681
Authors:Yuchao Li, Rongrong Ji, Shaohui Lin, Baochang Zhang, Chenqian Yan, Yongjian Wu, Feiyue Huang, Ling Shao
Title: Interpretable Neural Network Decoupling
Abstract:
The remarkable performance of convolutional neural networks (CNNs) is entangled with their huge number of uninterpretable parameters, which has become the bottleneck limiting the exploitation of their full potential. Towards network interpretation, previous endeavors mainly resort to the single filter analysis, which however ignores the relationship between filters. In this paper, we propose a novel architecture decoupling method to interpret the network from a perspective of investigating its calculation paths. More specifically, we introduce a novel architecture controlling module in each layer to encode the network architecture by a vector. By maximizing the mutual information between the vectors and input images, the module is trained to select specific filters to distill a unique calculation path for each input. Furthermore, to improve the interpretability and compactness of the decoupled network, the output of each layer is encoded to align the architecture encoding vector with the constraint of sparsity regularization. Unlike conventional pixel-level or filter-level network interpretation methods, we propose a path-level analysis to explore the relationship between the combination of filter and semantic concepts, which is more suitable to interpret the working rationale of the network. Extensive experiments show that the decoupled network achieves several applications, i.e., network interpretation, network acceleration, and adversarial samples detection."



Paperid:682
Authors:Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin
Title: Omni-sourced Webly-supervised Learning for Video Recognition
Abstract:
We introduce OmniSource, a novel framework for leveraging web data to train video recognition models. OmniSource overcomes the barriers between data formats, such as images, short videos, and long untrimmed videos for webly-supervised learning. First, data samples with multiple formats, curated by task-specific data collection and automatically filtered by a teacher model, are transformed into a unified form. Then a joint-training strategy is proposed to deal with the domain gaps between multiple data sources and formats in webly-supervised learning. Several good practices, including data balancing, resampling, and cross-dataset mixup are adopted in joint training. Experiments show that by utilizing data from multiple sources and formats, OmniSource is more data-efficient in training. With only 3.5M images and 800K minutes videos crawled from the internet without human labeling (less than 2% of prior works), our models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark. With OmniSource, we establish new records with different pretraining strategies for video recognition. Our best models achieve 80.4%, 80.5%, and 83.6% Top-1 accuracies on the Kinetics-400 benchmark respectively for training-from-scratch, ImageNet pre-training and IG-65M pre-training."



Paperid:683
Authors:Hang Xu, Shaoju Wang, Xinyue Cai, Wei Zhang, Xiaodan Liang, Zhenguo Li
Title: CurveLane-NAS: Unifying Lane-Sensitive Architecture Search and Adaptive Point Blending
Abstract:
We address the curve lane detection problem which poses more real-world challenges than conventional lane detection for better facilitating modern assisted/autonomous driving systems. Current hand-designed lane detection methods are not robust enough to capture the curve lanes especially the remote ones due to the lack of modeling both long-range contextual information and detailed curve trajectory. In this paper, we propose a novel lane-sensitive architecture search framework named CurveLane-NAS to automatically capture both long-ranged coherent and accurate short-range curve information while unifying both architecture search and post-processing on curve lane predictions via point blending. It consists of three search modules: a) a feature fusion search module to find a better fusion of the local and global context for multi-level hierarchy features; b) an elastic backbone search module to explore an efficient feature extractor with good semantics and latency; c) an adaptive point blending module to search a multi-level post-processing refinement strategy to combine multi-level head prediction. The unified framework ensures lane-sensitive predictions by the mutual guidance between NAS and adaptive point blending. Furthermore, we also steer forward to release a more challenging benchmark named CurveLanes for addressing the most difficult curve lanes. It consists of 150K images with 680K labels. Experiments on the new CurveLanes show that the SOTA lane detection methods suffer substantial performance drop while our model can still reach an 80+% F1-score. Extensive experiments on traditional lane benchmarks such as CULane also demonstrate the superiority of our CurveLane-NAS, e.g. achieving a new SOTA result 74.8% of F1-score on CULane dataset."



Paperid:684
Authors:Jiaxing Huang, Shijian Lu, Dayan Guan, Xiaobing Zhang
Title: Contextual-Relation Consistent Domain Adaptation for Semantic Segmentation
Abstract:
Recent advances in unsupervised domain adaptation for semantic segmentation have shown great potentials to relieve the demand of expensive per-pixel annotations. However, most existing works address the domain discrepancy by aligning the data distributions of two domains at a global image level whereas the local consistencies are largely neglected. This paper presents an innovative local contextual-relation consistent domain adaptation (CrCDA) technique that aims to achieve local-level consistencies during the global-level alignment. The idea is to take a closer look at region-wise feature representations and align them for local-level consistencies. Specifically, CrCDA learns and enforces the prototypical local contextual-relations explicitly in the feature space of a labelled source domain while transferring them to an unlabelled target domain via backpropagation-based adversarial learning. An adaptive entropy max-min adversarial learning scheme is designed to optimally align these hundreds of local contextual-relations across domain without requiring discriminator or extra computation overhead. The proposed CrCDA has been evaluated extensively over two challenging domain adaptive segmentation tasks (e.g., GTA5 to Cityscapes and SYNTHIA to Cityscapes), and experiments demonstrate its superior segmentation performance as compared with state-of-the-art methods."



Paperid:685
Authors:Weizhe Liu, Mathieu Salzmann, Pascal Fua
Title: Estimating People Flows to Better Count Them in Crowded Scenes
Abstract:
Modern methods for counting people in crowded scenes rely on deep networks to estimate people densities in individual images. As such, only very few take advantage of temporal consistency in video sequences, and those that do only impose weak smoothness constraints across consecutive frames. In this paper, we advocate estimating people flows across image locations between consecutive images and inferring the people densities from these flows instead of directly regressing. This enables us to impose much stronger constraints encoding the conservation of the number of people. As a result, it significantly boosts performance without requiring a more complex architecture. Furthermore, it also enables us to exploit the correlation between people flow and optical flow to further improve the results. We will demonstrate that we consistently outperform state-of-the-art methods on five benchmark datasets. "



Paperid:686
Authors:Han Fang, Weihong Deng, Yaoyao Zhong, Jiani Hu
Title: Generate to Adapt: Resolution Adaption Network for Surveillance Face Recognition
Abstract:
Although deep learning techniques have largely improved face recognition, unconstrained surveillance face recognition (FR) is still an unsolved challenge, due to the limited training data and the gap of domain distribution. Previous methods mostly match low-resolution and high-resolution faces in different domains, which tend to deteriorate the original feature space in the common recognition scenarios. To avoid this problem, we propose a novel resolution adaption network (RAN) which contains Multi-Resolution Generative Adversarial Networks (MR-GAN) followed by a feature adaption network. MR-GAN learns multi-resolution representations and randomly selects one resolution to generate realistic low-resolution (LR) faces that can avoid the artifacts of down-sampled faces. A novel feature adaption network with translation gate is developed to fuse the discriminative information of the generated LR faces into backbone network, while preserving the discrimination ability of original face representations. The experimental results on IJB-C TinyFace, SCface, QMUL-SurvFace datasets have demonstrated the superiority of our proposed method compared with state-of-the-art surveillance face recognition methods, while showing stable performance on the common recognition scenarios."



Paperid:687
Authors:Linyu Zheng, Ming Tang, Yingying Chen, Jinqiao Wang, Hanqing Lu
Title: Learning Feature Embeddings for Discriminant Model based Tracking
Abstract:
After observing that the features used in most online discriminatively trained trackers are not optimal, in this paper, we propose a novel and effective architecture to learn optimal feature embeddings for online discriminative tracking. Our method, called DCFST, integrates the solver of a discriminant model that is differentiable and has a closed-form solution into convolutional neural networks. Then, the resulting network can be trained in an end-to-end way, obtaining optimal feature embeddings for the discriminant model-based tracker. As an instance, we apply the popular ridge regression model in this work to demonstrate the power of DCFST. Extensive experiments on six public benchmarks, OTB2015, NFS, GOT10k, TrackingNet, VOT2018, and VOT2019, show that our approach is efficient and generalizes well to class-agnostic target objects in online tracking, thus achieves state-of-the-art accuracy, while running beyond the real-time speed. Code will be made available."



Paperid:688
Authors:Ningning Ma, Xiangyu Zhang, Jiawei Huang, Jian Sun
Title: WeightNet: Revisiting the Design Space of Weight Networks
Abstract:
We present a conceptually simple, flexible and effective framework for weight generating networks. Our approach is general that unifies two current distinct and extremely effective SENet and CondConv into the same framework on weight space. The method, called WeightNet, generalizes the two methods by simply adding one more grouped fully-connected layer to the attention activation layer. We use the WeightNet, composed entirely of (grouped) fully-connected layers, to directly output the convolutional weight. WeightNet is easy and memory-conserving to train, on the kernel space instead of the feature space. Because of the flexibility, our method outperforms existing approaches on both ImageNet and COCO detection tasks, achieving better Accuracy-FLOPs and Accuracy-Parameter trade-offs. The framework on the flexible weight space has the potential to further improve the performance."



Paperid:689
Authors:Ryuhei Takahashi, Atsushi Hashimoto, Motoharu Sonogashira, Masaaki Iiyama
Title: Partially-Shared Variational Auto-encoders for Unsupervised Domain Adaptation with Target Shift
Abstract:
This paper discusses unsupervised domain adaptation (UDA) with target shift, i.e., UDA with the non-identical label distributions of the source and target domains. In practice, this is an important problem; as we do not know labels in target domain datasets, we do not know whether or not its distribution is identical to that in the source domain dataset. Despite the inaccessibility to the shape of label distribution in the target domain, a common approach of modern UDA methods reduces the gap between the feature distributions of source and target domains, which implicitly assumes that the label distributions are identical, resulting in an unsatisfactory performance upon target shift. To overcome this problem, the proposed method, partially shared variational autoencoders (PS-VAEs), uses a pair-wise feature alignment instead of feature distribution matching. PS-VAEs inter-convert domain of each sample by a CycleGAN-based architecture while preserving its label-related content by sharing weights of two domain conversion branches as much as possible. To evaluate the performance of PS-VAEs, we carried out two experiments: UDA from synthesized data to real observation in human-pose estimation (regression) and UDA under controlled target shift intensities with digits datasets (classification). The proposed method outperformed the other methods in the regression task with a large margin while presenting its robustness with various levels of target shift."



Paperid:690
Authors:Zhengkai Jiang, Yu Liu, Ceyuan Yang, Jihao Liu, Peng Gao, Qian Zhang, Shiming Xiang, Chunhong Pan
Title: Learning Where to Focus for Efficient Video Object Detection
Abstract:
Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available."



Paperid:691
Authors:Aviv Shamsian, Ofri Kleinfeld, Amir Globerson, Gal Chechik
Title: Learning Object Permanence from Video
Abstract:
Object Permanence allows people to reason about the location of non-visible objects, by understanding that they continue to exist even when not perceived directly. Object Permanence is critical for building a model of the world, since objects in natural visual scenes dynamically occlude and contain each-other. Intensive studies in developmental psychology, suggesting that object permanence is a challenging task that is learned through extensive experience.Here we introduce the setup of learning Object Permanence from data. We explain why the learning problem should be dissected into four components, where objects are (1) visible, (2) occluded, (3) contained by another object and (4) carried by a containing object. The fourth subtask, where a target object is carried by a containing object, is particularly challenging because it requires a system to reason about a moving location of an invisible object. We then present a unified deep architecture that learns to predict object location under these four scenarios. We evaluate the architecture and system on a new dataset based on CATER, and find that it outperforms previous methods and various baselines. "



Paperid:692
Authors:Chuhan Zhang, Ankush Gupta, Andrew Zisserman
Title: Adaptive Text Recognition through Visual Matching
Abstract:
This work addresses the problems of generalization and flexibility for text recognition in documents. We introduce a new model that exploits the repetitive nature of characters in languages, and decouples the visual decoding and linguistic modelling stages through intermediate representations in the form of similarity maps. By doing this, we turn text recognition into a visual matching problem, thereby achieving generalization in appearance and flexibility in classes. We evaluate the model on both synthetic and real datasets across different languages and alphabets, and show that it can handle challenges that traditional architectures are unable to solve without expensive re-training, including: (i) it can change the number of classes simply by changing the exemplars; and (ii) it can generalize to novel languages and characters (not in the training data) simply by providing a new glyph exemplar set. In essence, it is able to carry out one-shot sequence recognition. We also demonstrate that the model can generalize to unseen fonts without requiring new exemplars from them."



Paperid:693
Authors:Yixuan Li, Zixu Wang, Limin Wang, Gangshan Wu
Title: Actions as Moving Points
Abstract:
The existing action tubelet detectors often depend on heuristic anchor design and placement, which might be computationally expensive and sub-optimal for precise localization. In this paper, we present a conceptually simple, computationally efficient, and more precise action tubelet detection framework, termed as MovingCenter Detector (MOC-detector), by treating an action instance as a trajectory of moving points. Based on the insight that movement information could simplify and assist action tubelet detection, our MOC-detector is composed of three crucial head branches: (1) Center Branch for instance center detection and action recognition, (2) Movement Branch for movement estimation at adjacent frames to form trajectories of moving points, (3) Box Branch for spatial extent detection by directly regressing bounding box size at each estimated center. These three branches work together to generate the tubelet detection results, which could be further linked to yield video-level tubes with a matching strategy. Our MOC-detector outperforms the existing state-of-the-art methods for both metrics of frame-mAP and video-mAP on the JHMDB and UCF101-24 datasets. The performance gap is more evident for higher video IoU, demonstrating that our MOC-detector is particularly effective for more precise action detection. We provide the code at https://github.com/MCG-NJU/MOC-Detector . "



Paperid:694
Authors:Yuhuang Hu, Tobi Delbruck, Shih-Chii Liu
Title: Learning to Exploit Multiple Vision Modalities by Using Grafted Networks
Abstract:
Novel vision sensors such as thermal, hyperspectral, polarization, and event cameras provide information that is not available from conventional intensity cameras. An obstacle to using these sensors with current powerful deep neural networks is the lack of large labeled training datasets. This paper proposes a Network Grafting Algorithm (NGA), where a new front end network driven by unconventional visual inputs replaces the front end network of a pretrained deep network that processes intensity frames. The self-supervised training uses only synchronously-recorded intensity frames and novel sensor data to maximize feature similarity between the pretrained network and the grafted network. We show that the enhanced grafted network reaches competitive average precision (AP50) scores to the pretrained network on an object detection task using thermal and event camera datasets, with no increase in inference costs. Particularly, the grafted network driven by thermal frames showed a relative improvement of 49.11% over the use of intensity frames. The grafted front end has only 5--8% of the total parameters and can be trained in a few hours on a single GPU equivalent to 5% of the time that would be needed to train the entire object detector from labeled data. NGA allows new vision sensors to capitalize on previously pretrained powerful deep models, saving on training cost and widening a range of applications for novel sensors."



Paperid:695
Authors:Alexander Grabner, Yaming Wang, Peizhao Zhang, Peihong Guo, Tong Xiao, Peter Vajda, Peter M. Roth, Vincent Lepetit
Title: Geometric Correspondence Fields: Learned Differentiable Rendering for 3D Pose Refinement in the Wild
Abstract:
We present a novel 3D pose refinement approach based on differentiable rendering for objects of arbitrary categories in the wild. In contrast to previous methods, we make two main contributions: First, instead of comparing real-world images and synthetic renderings in the RGB or mask space, we compare them in a feature space optimized for 3D pose refinement. Second, we introduce a novel differentiable renderer that learns to approximate the rasterization backward pass from data instead of relying on a hand-crafted algorithm. For this purpose, we predict deep cross-domain correspondences between RGB images and 3D model renderings in the form of what we call geometric correspondence fields. These correspondence fields serve as pixel-level gradients which are analytically propagated backward through the rendering pipeline to perform a gradient-based optimization directly on the 3D pose. In this way, we precisely align 3D models to objects in RGB images which results in significantly improved 3D pose estimates. We evaluate our approach on the challenging Pix3D dataset and achieve up to 55% relative improvement compared to state-of-the-art refinement methods in multiple metrics."



Paperid:696
Authors:Zhong Li, Yu Ji, Jingyi Yu, Jinwei Ye
Title: 3D Fluid Flow Reconstruction Using Compact Light Field PIV
Abstract:
Particle Imaging Velocimetry (PIV) estimates the fluid flow by analyzing the motion of injected particles. The problem is challenging as the particles lie at different depths but have similar appearances. Tracking a large number of moving particles is particularly difficult due to the heavy occlusion. In this paper, we present a PIV solution that uses a compact lenslet-based light field camera to track dense particles floating in the fluid and reconstruct the 3D fluid flow. We exploit the focal symmetry property in the light field focal stacks for recovering the depths of similar-looking particles. We further develop a motion-constrained optical flow estimation algorithm by enforcing the local motion rigidity and the Navier-Stoke fluid constraint. Finally, the estimated particle motion trajectory is used to visualize the 3D fluid flow. Comprehensive experiments on both synthetic and real data show that using a compact light field camera, our technique can recover dense and accurate 3D fluid flow. "



Paperid:697
Authors:Sharat Agarwal, Himanshu Arora, Saket Anand, Chetan Arora
Title: Contextual Diversity for Active Learning
Abstract:
Requirement of large annotated datasets restricts the use of deep convolutional neural networks (CNNs) for many practical applications. The problem can be mitigated by using active learning (AL) techniques which, under a given annotation budget, allow to select a subset of data that yields maximum accuracy upon fine tuning. State of the art AL approaches typically relies on measures of visual diversity or prediction uncertainty, which are unable to effectively capture the variations in the spatial context. On the other hand, modern CNN architectures make heavy use of spatial context for achieving highly accurate predictions. Since the context is difficult to evaluate in the absence of ground-truth labels, we introduce the notion of contextual diversity that captures the confusion associated with spatially co-occurring classes. Contextual Diversity (CD) hinges on a crucial observation that the probability vector predicted by a CNN for a region of interest typically contains information from a larger receptive field. Exploiting this observation, we use the proposed CD measure within two AL frameworks: (1) a core-set based strategy and (2) a reinforcement learning based policy, for the active frame selection. Our extensive empirical evaluation establishes state of the art results for active learning on benchmark datasets of Semantic Segmentation, Object Detection and Image classification. Our ablation studies show clear advantages of using contextual diversity for active learning."



Paperid:698
Authors:Fadime Sener, Dipika Singhania, Angela Yao
Title: Temporal Aggregate Representations for Long-Range Video Understanding
Abstract:
Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition."



Paperid:699
Authors:Zhe Niu, Brian Mak
Title: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition
Abstract:
In this paper, we propose novel stochastic modeling of various components of a continuous sign language recognition (CSLR) system that is based on the transformer encoder and connectionist temporal classification (CTC). Most importantly, We model each sign gloss with multiple states, and the number of states is a categorical random variable that follows a learned probability distribution, providing stochastic fine-grained labels for training the CTC decoder. We further propose a stochastic frame dropping mechanism and a gradient stopping method to deal with the severe overfitting problem in training the transformer model with CTC loss. These two methods also help reduce the training computation, both in terms of time and space, significantly. We evaluated our model on popular CSLR datasets, and show its effectiveness compared to the state-of-the-art methods."



Paperid:700
Authors:Sinisa Stekovic, Shreyas Hampali, Mahdi Rad, Sayan Deb Sarkar, Friedrich Fraundorfer, Vincent Lepetit
Title: General 3D Room Layout from a Single View by Render-and-Compare
Abstract:
We present a novel method to reconstruct the 3D layout of a room—walls, floors, ceilings—from a single perspective view in challenging conditions, by contrast with previous single-view methods restricted to cuboid-shaped layouts. This input view can consist of a color image only, but considering a depth map results in a more accurate reconstruction. Our approach is formalized as solving a constrained discrete optimization problem to find the set of 3D polygons that constitute the layout. In order to deal with occlusions between components of the layout, which is a problem ignored by previous works, we introduce an analysis-by-synthesis method to iteratively refine the 3D layout estimate. As no dataset was available to evaluate our method quantitatively, we created one together with several appropriate metrics. Our dataset consists of 293 images from ScanNet, which we annotated with precise 3D layouts. It offers three times more samples than the popular NYUv2 303 benchmark, and a much larger variety of layouts."



Paperid:701
Authors:Vikramjit Sidhu, Edgar Tretschk, Vladislav Golyanik, Antonio Agudo, Christian Theobalt
Title: Neural Dense Non-Rigid Structure from Motion with Latent Space Constraints
Abstract:
We introduce the first dense neural non-rigid structure from motion (N-NRSfM) approach, which can be trained end-to-end in an unsupervised manner from 2D point tracks. Compared to the competing methods, our combination of loss functions is fully-differentiable and can be readily integrated into deep-learning systems. We formulate the deformation model by an auto-decoder and impose subspace constraints on the recovered latent space function in a frequency domain. Thanks to the state recurrence cue, we classify the reconstructed non-rigid surfaces based on their similarity and recover the period of the input sequence. Our N-NRSfM approach achieves competitive accuracy on widely-used benchmark sequences and high visual quality on various real videos. Apart from being a standalone technique, our method enables multiple applications including shape compression, completion and interpolation, among others. Combined with an encoder trained directly on 2D images, we perform scenario-specific monocular 3D shape reconstruction at interactive frame rates. To facilitate the reproducibility of the results and boost the new research direction, we open-source our code and provide trained models for research purposes. "



Paperid:702
Authors:Anelise Newman, Camilo Fosco, Vincent Casser, Allen Lee, Barry McNamara, Aude Oliva
Title: Multimodal Memorability: Modeling Effects of Semantics and Decay on Video Memorability
Abstract:
A key capability of an intelligent system is deciding when events from past experience must be remembered and when they can be forgotten. Towards this goal, we develop a predictive model of human visual event memory and how those memories decay over time. We introduce Memento10k, a new, dynamic video memorability dataset containing human annotations at different viewing delays. Based on our findings, we propose a new mathematical formulation of memorability decay, resulting in a model that is able to produce the first quantitative estimation of how a video decays in memory over time. In contrast with previous work, our model can predict the probability that a video will be remembered at an arbitrary delay. Importantly, our approach combines visual and semantic information (in the form of textual captions) to fully represent the meaning of events. Our experiments on two video memorability benchmarks, including Memento10k, show that our model significantly improves upon the best prior approach (by 12% on average). "



Paperid:703
Authors:Qizhang Li, Yiwen Guo, Hao Chen
Title: Yet Another Intermediate-Level Attack
Abstract:
The transferability of adversarial examples across deep neural network (DNN) models is the crux of a spectrum of black-box attacks. In this paper, we propose a novel method to enhance the black-box transferability of baseline adversarial examples. By establishing a linear mapping of the intermediate-level discrepancies (between a set of adversarial inputs and their benign counterparts) for predicting their evoked adversarial loss, we manage to take full advantage of the optimization procedure of baseline attacks. We conduct extensive experiments to verify the effectiveness of our method on CIFAR-100 and ImageNet. Experimental results demonstrate that it outperforms previous state-of-the-arts by large margins. Our code will be made publicly available."



Paperid:704
Authors:Chao Li, Xiaohu Guo
Title: Topology-Change-Aware Volumetric Fusion for Dynamic Scene Reconstruction
Abstract:
Topology change is a challenging problem for 4D reconstruction of dynamic scenes. In the classic volumetric fusion-based framework, a mesh is usually extracted from the TSDF volume as the canonical surface representation to help estimating deformation field. However, the surface and Embedded Deformation Graph (EDG) representations bring conflicts under topology changes since the surface mesh has fixed-connectivity but the deformation field can be discontinuous. In this paper, the classic framework is re-designed to enable 4D reconstruction of dynamic scene under topology changes, by introducing a novel structure of Non-manifold Volumetric Grid to the re-design of both TSDF and EDG, which allows connectivity updates by cell splitting and replication. Experiments show convincing reconstruction results for dynamic scenes of topology changes, as compared to the state-of-the-art methods."



Paperid:705
Authors:Qunliang Xing, Mai Xu, Tianyi Li, Zhenyu Guan
Title: Early Exit Or Not: Resource-Efficient Blind Quality Enhancement for Compressed Images
Abstract:
Lossy image compression is pervasively conducted to save communication bandwidth, resulting in undesirable compression artifacts. Recently, extensive approaches have been proposed to reduce image compression artifacts at the decoder side; however, they require a series of architecture-identical models to process images with different quality, which are inefficient and resource-consuming. Besides, it is common in practice that compressed images are with unknown quality and it is intractable for existing approaches to select a suitable model for blind quality enhancement. In this paper, we propose a resource-efficient blind quality enhancement (RBQE) approach for compressed images. Specifically, our approach blindly and progressively enhances the quality of compressed images through a dynamic deep neural network (DNN), in which an early-exit strategy is embedded. Then, our approach can automatically decide to terminate or continue enhancement according to the assessed quality of enhanced images. Consequently, slight artifacts can be removed in a simpler and faster process, while the severe artifacts can be further removed in a more elaborate process. Extensive experiments demonstrate that our RBQE approach achieves state-of-the-art performance in terms of both blind quality enhancement and resource efficiency."



Paperid:706
Authors:Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Carsten Stoll, Christian Theobalt
Title: PatchNets: Patch-Based Generalizable Deep Implicit 3D Shape Representations
Abstract:
Implicit surface representation combined with deep learning has led to impressive models which can represent detailed shapes of objects. Implicit surface representations, such as signed-distance functions, allow to represent shapes of arbitrary topologies. Since a continous function is learned, the reconstructions can also be extracted at any arbitrary resolution. However, large shape datasets such as Shapenet are required to train such models. In this paper, we present a mid-level patch-based surface representation. At the level of patches, objects across different categories share similarities, which leads to more generalizable models. We introduce a novel method to learn these patches in a canonical space, such that they are as object-agnostic as possible. We show that patches trained on one category of objects from ShapeNet can also represent detailed shapes from any other category as well. In addition, our patches can be trained using much fewer shapes, compared to existing approaches. We show several applications of our new representation, including shape interpolation and partial point cloud completion. Due to explicit control over patch extrinsics, our representation is also more controllable compared to object-level representations, which we demonstrate by non-rigidly deforming shapes. "



Paperid:707
Authors:Yipeng Qin, Niloy Mitra, Peter Wonka
Title: How does Lipschitz Regularization Influence GAN Training?
Abstract:
Despite the success of Lipschitz regularization in stabilizing GAN training, the exact reason of its effectiveness remains poorly understood. The direct effect of $K$-Lipschitz regularization is to restrict the $L2$-norm of the neural network gradient to be smaller than a threshold $K$ (e.g., $K=1$) such that $\| Lipschitz regularization ensures that all loss functions effectively work in the same way. Empirically, we verify our proposition on the MNIST, CIFAR10 and CelebA datasets."



Paperid:708
Authors:Yukai Lin, Viktor Larsson, Marcel Geppert, Zuzana Kukelova, Marc Pollefeys, Torsten Sattler
Title: Infrastructure-based Multi-Camera Calibration using Radial Projections
Abstract:
Multi-camera systems are an important sensor platform for intelligent systems such as self-driving cars. Pattern-based calibration techniques can be used to calibrate the intrinsics of the cameras individually. However, extrinsic calibration of systems with little to no visual overlap between the cameras is a challenge. Given the camera intrinsics, infrastucture-based calibration techniques are able to estimate the extrinsics using 3D maps pre-built via SLAM or Structure-from-Motion. In this paper, we propose to fully calibrate a multi-camera system from scratch using an infrastructure-based approach. Assuming that the distortion is mainly radial, we introduce a two-stage approach. We first estimate the camera-rig extrinsics up to a single unknown translation component per camera. Next, we solve for both the intrinsic parameters and the missing translation components. Extensive experiments on multiple indoor and outdoor scenes with multiple multi-camera systems show that our calibration method achieves high accuracy and robustness. In particular, our approach is more robust than the naive approach of first estimating intrinsic parameters and pose per camera before refining the extrinsic parameters of the system. The implementation is available at https://github.com/youkely/InfrasCal."



Paperid:709
Authors:Heeseung Kwon, Manjin Kim, Suha Kwak, Minsu Cho
Title: MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Abstract:
Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets."



Paperid:710
Authors:Masada Tzabari, Yoav Y. Schechner
Title: Polarized Optical-Flow Gyroscope
Abstract:
We merge by generalization two principles of passive optical sensing of motion. One is common spatially resolved imaging, where motion induces temporal readout changes at high-contrast spatial features, as used in traditional optical-flow. The other is the polarization compass, where axial rotation induces temporal readout changes due to the change of incoming polarization angle, relative to the camera frame. The latter has traditionally been modeled for uniform objects. This merger generalizes the brightness constancy assumption and optical-flow, to handle polarization. It also generalizes the polarization compass concept to handle arbitrarily textured objects. This way, scene regions having partial polarization contribute to motion estimation, irrespective of their texture and non-uniformity. As an application, we derive and demonstrate passive sensing of differential ego-rotation around the camera optical axis. "



Paperid:711
Authors:Da Li, Timothy Hospedales
Title: Online Meta-Learning for Multi-Source and Semi-Supervised Domain Adaptation
Abstract:
Domain adaptation (DA) is the topical problem of adapting models from labelled source datasets so that they perform well on target datasets where only unlabelled or partially labelled data is available. Many methods have been proposed to address this problem through different ways to minimise the domain shift between source and target datasets. In this paper we take an orthogonal perspective and propose a framework to further enhance performance by meta-learning the initial conditions of existing DA algorithms. This is challenging compared to the more widely considered setting of few-shot meta-learning, due to the length of the computation graph involved. Therefore we propose an online shortest-path meta-learning framework that is both computationally tractable and practically effective for improving DA performance. We present variants for both multi-source unsupervised domain adaptation (MSDA), and semi-supervised domain adaptation (SSDA). Importantly, our approach is agnostic to the base adaptation algorithm, and can be applied to improve many techniques. Experimentally, we demonstrate improvements on classic (DANN) and recent (MCD and MME) techniques for MSDA and SSDA, and ultimately achieve state of the art results on several DA benchmarks including the largest scale DomainNet."



Paperid:712
Authors:Yaoyao Liu, Bernt Schiele, Qianru Sun
Title: An Ensemble of Epoch-wise Empirical Bayes for Few-shot Learning
Abstract:
Few-shot learning aims to train efficient predictive models with a few examples. The lack of training data leads to poor models that perform high-variance or low-confidence predictions. In this paper, we propose to meta-learn the ensemble of epoch-wise empirical Bayes models (E3BM) to achieve robust predictions. ""Epoch-wise"" means that each training epoch has a Bayes model whose parameters are specifically learned and deployed. ""Empirical"" means that the hyperparameters, e.g., used for learning and ensembling the epoch-wise models, are generated by hyperprior learners conditional on task-specific data. We introduce four kinds of hyperprior learners by considering inductive vs. transductive, and epoch-dependent vs. epoch-independent, in the paradigm of meta-learning. We conduct extensive experiments for five-class few-shot learning tasks on three challenging benchmarks: miniImageNet, tieredImageNet, and FC100, and achieve top performance using the epoch-dependent transductive hyperprior learner, which captures the richest information. Our ablation study shows that both ""epoch-wise ensemble"" and ""empirical"" encourage high efficiency and robustness in the model performance."



Paperid:713
Authors:Silvia Bucci, Mohammad Reza Loghmani, Tatiana Tommasi
Title: On the Effectiveness of Image Rotation for Open Set Domain Adaptation
Abstract:
Open Set Domain Adaptation (OSDA) bridges the domain gap between a labeled source domain and an unlabeled target domain, while also rejecting target classes that are not present in the source. To avoid negative transfer, OSDA can be tackled by first separating the known/unknown target samples and then aligning known target samples with the source data. We propose a novel method to addresses both these problems using the self-supervised task of rotation recognition. Moreover, we assess the performance with a new open set metric that properly balances the contribution of recognizing the known classes and rejecting the unknown samples. Comparative experiments with existing OSDA methods on the standard Office-31 and Office-Home benchmarks show that: (i) our method outperforms its competitors, (ii) reproducibility for this field is a crucial issue to tackle, (iii) our metric provides a reliable tool to allow fair open set evaluation."



Paperid:714
Authors:Kwang In Kim, Christian Richardt, Hyung Jin Chang
Title: Combining Task Predictors via Enhancing Joint Predictability
Abstract:
Predictor combination aims to improve a (target) predictor of a learning task based on the (reference) predictors of potentially relevant tasks, without having access to the internals of individual predictors. We present a new predictor combination algorithm that improves the target by i) measuring the relevance of references based on their capabilities in predicting the target, and ii) strengthening such estimated relevance. Unlike existing predictor combination approaches that only exploit pairwise relationships between the target and each reference, and thereby ignore potentially useful dependence among references, our algorithm jointly assesses the relevance of all references by adopting a Bayesian framework. This also offers a rigorous way to automatically select only relevant references. Based on experiments on six real-world datasets from visual attribute ranking and multi-class classification scenarios, we demonstrate that our algorithm offers a significant performance gain and broadens the application range of existing predictor combination approaches."



Paperid:715
Authors:Jiaxi Wu, Songtao Liu, Di Huang, Yunhong Wang
Title: Multi-Scale Positive Sample Refinement for Few-Shot Object Detection
Abstract:
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances, and is useful when manual annotation is time-consuming or data acquisition is limited. Unlike previous attempts that exploit few-shot classification techniques to facilitate FSOD, this work highlights the necessity of handling the problem of scale variations, which is challenging due to the unique sample distribution. To this end, we propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD. It generates multi-scale positive samples as object pyramids and refines the prediction at various scales. We demonstrate its advantage by integrating it as an auxiliary branch to the popular architecture of Faster R-CNN with FPN, delivering a strong FSOD solution. Several experiments are conducted on PASCAL VOC and MS COCO, and the proposed approach achieves state of the art results and significantly outperforms other counterparts, which shows its effectiveness."



Paperid:716
Authors:Carl Toft, Daniyar Turmukhambetov, Torsten Sattler, Fredrik Kahl, Gabriel J. Brostow
Title: Single-Image Depth Prediction Makes Feature Matching Easier
Abstract:
Good local features improve the robustness of many 3D re-localization and multi-view reconstruction pipelines. The problem is that viewing angle and distance severely impact the recognizability of a local feature. Attempts to improve appearance invariance by choosing better local feature points or by leveraging outside information, have come with pre-requisites that made some of them impractical. In this paper, we propose a surprisingly effective enhancement to local feature extraction, which improves matching. We show that CNN-based depths inferred from single RGB images, are quite helpful, despite their flaws. They allow us to pre-warp images and rectify perspective distortions, to significantly enhance SIFT and BRISK features, enabling more good matches, even when cameras are looking at the same scene but in opposite directions. "



Paperid:717
Authors:Duo Li, Qifeng Chen
Title: Deep Reinforced Attention Learning for Quality-Aware Visual Recognition
Abstract:
In this paper, we build upon the weakly-supervised generation mechanism of intermediate attention maps in any convolutional neural networks and disclose the effectiveness of attention modules more straightforwardly to fully exploit their potential. Given an existing neural network equipped with arbitrary attention modules, we introduce a meta critic network to evaluate the quality of attention maps in the main network. Due to the discreteness of our designed reward, the proposed learning method is arranged in a reinforcement learning setting, where the attention actors and recurrent critics are alternately optimized to provide instant critique and revision for the temporary attention representation, hence coined as Deep REinforced Attention Learning (DREAL). It could be applied universally to network architectures with different types of attention modules and promotes their expressive ability by maximizing the relative gain of the final recognition performance arising from each individual attention module, as demonstrated by extensive experiments on both category and instance recognition benchmarks."



Paperid:718
Authors:Yuxi Li, Weiyao Lin, John See, Ning Xu Shugong Xu, Ke Yan, Cong Yang
Title: CFAD: Coarse-to-Fine Action Detector for Spatiotemporal Action Localization
Abstract:
Most current pipelines for spatiotemporal action localization connect frame-wise or clip-wise detection results to generate action proposals. In this paper, we propose Coarse-to-Fine Action Detector (CFAD), an original end-to-end trainable framework for efficient spatiotemporal action localization. The CFAD introduces a new paradigm that first estimates coarse spatiotemporal action tubes from video streams, and then refines the tubes’ location based on key timestamps. This concept is implemented by two key components, the Coarse and Refine Modules in our frame-work. The parameterized modeling of long temporal information in the Coarse Module helps obtain accurate classification and initial tube estimation, while the Refine Module selectively adjusts the tube location under the guidance of key timestamps. Against other methods, the proposed CFAD achieves state-of-the-art results on action detection bench-marks of UCF101-24, UCFSports and JHMDB-21 with inference speed that is 3.3× faster than the nearest competitor."



Paperid:719
Authors:Yanhong Zeng, Jianlong Fu, Hongyang Chao
Title: Learning Joint Spatial-Temporal Transformations for Video Inpainting
Abstract:
High-quality video inpainting that completes missing regions in video frames is a promising yet challenging task. State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos. In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss. To show the superiority of the proposed model, we conduct both quantitative and qualitative evaluations by using standard stationary masks and more realistic moving object masks. Demo videos are available at https://github.com/researchmm/STTN."



Paperid:720
Authors:Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, Jian Sun
Title: Single Path One-Shot Neural Architecture Search with Uniform Sampling
Abstract:
We revisit the one-shot Neural Architecture Search (NAS) paradigm and analyze its advantages over existing NAS approaches. Existing one-shot method, however, is hard to train and not yet effective on large scale datasets like ImageNet. This work propose a Single Path One-Shot model to address the challenge in the training. Our central idea is to construct a simplified supernet, where all architectures are single paths so that weight co-adaption problem is alleviated. Training is performed by uniform path sampling. All architectures (and their weights) are trained fully and equally. Comprehensive experiments verify that our approach is flexible and effective. It is easy to train and fast to search. It effortlessly supports complex search spaces (e.g., building blocks, channel, mixed-precision quantization) and different search constraints (e.g., FLOPs, latency). It is thus convenient to use for various needs. It achieves start-of-the-art performance on the large dataset ImageNet."



Paperid:721
Authors:Kaiyang Zhou, Yongxin Yang, Timothy Hospedales, Tao Xiang
Title: Learning to Generate Novel Domains for Domain Generalization
Abstract:
This paper focuses on domain generalization (DG), the task of learning from multiple source domains a model that generalizes well to unseen domains. A main challenge for DG is that the available source domains often exhibit limited diversity, hampering the model's ability to learn to generalize. We therefore employ a data generator to synthesize data from pseudo-novel domains to augment the source domains. This explicitly increases the diversity of available training domains and leads to a more generalizable model. To train the generator, we model the distribution divergence between source and synthesized pseudo-novel domains using optimal transport, and maximize the divergence. To ensure that semantics are preserved in the synthesized data, we further impose cycle-consistency and classification losses on the generator. Our method, L2A-OT (Learning to Augment by Optimal Transport) outperforms current state-of-the-art DG methods on four benchmark datasets."



Paperid:722
Authors:Theodora Kontogianni, Michael Gygli, Jasper Uijlings, Vittorio Ferrari
Title: Continuous Adaptation for Interactive Object Segmentation by Learning from Corrections
Abstract:
In interactive object segmentation a user collaborates with a computer vision model to segment an object. Recent works employ convolutional neural networks for this task: Given an image and a set of corrections made by the user as input, they output a segmentation mask. These approaches achieve strong performance by training on large datasets but they keep the model parameters unchanged at test time. Instead, we recognize that user corrections can serve as sparse training examples and we propose a method that capitalizes on that idea to update the model parameters on-the-fly to the data at hand. Our approach enables the adaptation to a particular object and its background, to distributions shifts in a test set, to specific object classes, and even to large domain changes, where the imaging modality changes between training and testing. We perform extensive experiments on 8 diverse datasets and show: Compared to a model with frozen parameters, our method reduces the required corrections (i) by 9%-30% when distribution shifts are small between training and testing; (ii) by 12%-44% when specializing to a specific class; (iii) and by 60% and 77% when we completely change domain between training and testing."



Paperid:723
Authors:Othman Sbai, Camille Couprie, Mathieu Aubry
Title: Impact of base dataset design on few-shot image classification
Abstract:
The quality and generality of deep image features is crucially determined by the data they have been trained on, but little is known about this often overlooked effect. In this paper, we systematically study the effect of variations in the training data by evaluating deep features trained on different image sets in a few-shot classification setting. The experimental protocol we define allows to explore key practical questions. What is the influence of the similarity between base and test classes? Given a fixed annotation budget, what is the optimal trade-off between the number of images per class and the number of classes? Given a fixed dataset, can features be improved by splitting or combining different classes? Should simple or diverse classes be annotated? In a wide range of experiments, we provide clear answers to these questions on the miniImageNet, ImageNet and CUB-200 datasets. We also show how the base dataset design can improve performance in few-shot classification more drastically than replacing a simple baseline by an advanced state of the art algorithm."



Paperid:724
Authors:Yuming Shen, Jie Qin, Lei Huang, Li Liu, Fan Zhu, Ling Shao
Title: Invertible Zero-Shot Recognition Flows
Abstract:
Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. The proposed Invertible Zero-shot Flow (IZF) learns factorized data embeddings (i.e., the semantic factors and the non-semantic ones) with the forward pass of an invertible flow network, while the reverse pass generates data samples. This procedure theoretically extends conventional generative flows to a conditional scheme. To explicitly solve the bias problem, our model enlarges the seen-unseen distributional discrepancy based on negative sample-based distance measurement. Notably, IZF works flexibly with either a naive Bayesian classifier or a held-out trainable one for zero-shot recognition. Experiments on widely-adopted ZSL benchmarks demonstrate the significant performance gain of IZF over existing methods, in terms of both classic and generalized settings."



Paperid:725
Authors:Weidong Zhang, Wei Zhang, Yinda Zhang
Title: GeoLayout: Geometry Driven Room Layout Estimation Based on Depth Maps of Planes
Abstract:
The task of room layout estimation is to locate the wall-floor, wall-ceiling, and wall-wall boundaries. Most recent methods solve this problem based on edge/keypoint detection or semantic segmentation. However, these approaches have shown limited attention on the geometry of the dominant planes and the intersection between them, which has significant impact on room layout. In this work, we propose to incorporate geometric reasoning to deep learning for layout estimation. Our approach learns to infer the depth maps of the dominant planes in the scene by predicting the pixel-level surface parameters, and the layout can be generated by the intersection of the depth maps. Moreover, we present a new dataset with pixel-level depth annotation of dominant planes. It is larger than the existing datasets and contains both cuboid and non-cuboid rooms. Experimental results show that our approach produces considerable performance gains on both 2D and 3D datasets."



Paperid:726
Authors:Raul Gomez, Jaume Gibert, Lluis Gomez, Dimosthenis Karatzas
Title: Location Sensitive Image Retrieval and Tagging
Abstract:
People from different parts of the globe describe objects and concepts in distinct manners. Visual appearance can thus vary across different geographic locations, which makes location a relevant contextual information when analysing visual data. In this work, we address the task of image retrieval related to a given tag conditioned on a certain location on Earth. We present LocSens, a model that learns to rank triplets of images, tags and coordinates by plausibility, and two training strategies to balance the location influence in the final ranking. LocSens learns to fuse textual and location information of multimodal queries to retrieve related images at different levels of location granularity, and successfully utilizes location information to improve image tagging. Code and models will be available to ensure reproducibility."



Paperid:727
Authors:Wei Zeng, Sezer Karaoglu, Theo Gevers
Title: Joint 3D Layout and Depth Prediction from a Single Indoor Panorama Image
Abstract:
In this paper, we propose a method which jointly learns layout prediction and depth estimation from a single indoor panorama image. Previous methods have considered layout prediction and depth estimation from a single panorama image separately. However, these two tasks are tightly intertwined. Leveraging the layout depth map as an intermediate representation, our proposed method outperforms existing methods for both panorama layout prediction and depth estimation. Experiments on the challenging real-world dataset of Stanford 2D-3D demonstrate that our approach obtains superior performance for both the layout prediction tasks (3D IoU: 85.81% v.s. 79.79%) and the depth estimation (Abs Rel: 0.068 v.s. 0.079)."



Paperid:728
Authors:Wei Pang, Xiaojie Wang
Title: Guessing State Tracking for Visual Dialogue
Abstract:
The Guesser is a task of visual grounding in GuessWhat?! like visual dialogue. It locates the target object in an image supposed by an Oracle oneself over a question-answer based dialogue between a Questioner and the Oracle. Most existing guessers make one and only one guess after receiving all question-answer pairs in a dialogue with the predefined number of rounds. This paper proposes a guessing state for the Guesser, and regards guess as a process with change of guessing state through a dialogue. A guessing state tracking based guess model is therefore proposed. The guessing state is defined as a distribution on objects in the image. With that in hand, two loss functions are defined as supervisions for model training. Early supervision brings supervision to Guesser at early rounds, and incremental supervision brings monotonicity to the guessing state. Experimental results on GuessWhat?! dataset show that our model significantly outperforms previous models, achieves new state-of-the-art, especially the success rate of guessing 83.3% is approaching the human-level accuracy of 84.4%."



Paperid:729
Authors:Ahmet Iscen, Jeffrey Zhang, Svetlana Lazebnik, Cordelia Schmid
Title: Memory-Efficient Incremental Learning Through Feature Adaptation
Abstract:
We introduce an approach for incremental learning that preserves feature descriptors of training images from previously learned classes, instead of the images themselves, unlike most existing work. Keeping the much lower-dimensional feature embeddings of images reduces the memory footprint significantly. We assume that the model is updated incrementally for new classes as new data becomes available sequentially.This requires adapting the previously stored feature vectors to the updated feature space without having access to the corresponding original training images. Feature adaptation is learned with a multi-layer perceptron, which is trained on feature pairs corresponding to the outputs of the original and updated network on a training image. We validate experimentally that such a transformation generalizes well to the features of the previous set of classes, and maps features to a discriminative subspace in the feature space. As a result, the classifier is optimized jointly over new and old classes without requiring old class images. Experimental results show that our method achieves state-of-the-art classification accuracy in incremental learning benchmarks, while having at least an order of magnitude lower memory footprint compared to image-preserving strategies."



Paperid:730
Authors:Justus Thies, Mohamed Elgharib, Ayush Tewari, Christian Theobalt, Matthias Nießner
Title: Neural Voice Puppetry: Audio-driven Facial Reenactment
Abstract:
We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with the audio of the source input. This audio-driven facial reenactment is driven by a deep neural network that employs a latent 3D face model space. Through the underlying 3D representation, the model inherently learns temporal stability while we leverage neural rendering to generate photo-realistic output frames. Our approach generalizes across different people, allowing us to synthesize videos of a target actor with the voice of any unknown source actor or even synthetic voices that can be generated utilizing standard text-to-speech approaches. Neural Voice Puppetry has a variety of use-cases, including audio-driven video avatars, video dubbing, and text-driven video synthesis of a talking head. We demonstrate the capabilities of our method in a series of audio- and text-based puppetry examples, including comparisons to state-of-the-art techniques and a user study."



Paperid:731
Authors:Antonio D’Innocente, Francesco Cappio Borlino, Silvia Bucci, Barbara Caputo, Tatiana Tommasi
Title: One-Shot Unsupervised Cross-Domain Detection
Abstract:
Despite impressive progress in object detection over the last years, it is still an open challenge to reliably detect objects across visual domains. Although the topic has attracted attention recently, current approaches all rely on the ability to access a sizable amount of target data for use at training time. This is a heavy assumption, as often it is not possible to anticipate the domain where a detector will be used, nor to access it in advance for data acquisition. Consider for instance the task of monitoring a feed of images from social media: as each image is created and uploaded by a different user, each belongs to a different target domain that is impossible to foresee during training. This paper addresses this setting, presenting an object detection algorithm able to perform unsupervised adaption across domains by using only one target sample, seen at test time. We achieve this by introducing a multi-task architecture that one-shot adapts to any incoming sample by iteratively solving a self-supervised task on it. We further enhance this auxiliary adaptation with cross-task pseudo-labeling. A thorough benchmark analysis against the most recent cross-domain detection methods and a detailed ablation study show the advantage of our method, which sets the state-of-the-art in the defined one-shot scenario. "



Paperid:732
Authors:Majed El Helou, Ruofan Zhou, Sabine Süsstrunk
Title: Stochastic Frequency Masking to Improve Super-Resolution and Denoising Networks
Abstract:
Super-resolution and denoising are ill-posed yet fundamental image restoration tasks. In blind settings, the degradation kernel or the noise level are unknown. This makes restoration even more challenging, notably for learning-based methods, as they tend to overfit to the degradation seen during training. We present an analysis, in the frequency domain, of degradation-kernel overfitting in super-resolution and introduce a conditional learning perspective that extends to both super-resolution and denoising. Building on our formulation, we propose a stochastic frequency masking of images used in training to regularize the networks and address the overfitting problem. Our technique improves state-of-the-art methods on blind super-resolution with different synthetic kernels, real super-resolution, blind Gaussian denoising, and real-image denoising."



Paperid:733
Authors:Anthony Hu, Fergal Cotter, Nikhil Mohan, Corina Gurau, Alex Kendall
Title: Probabilistic Future Prediction for Video Scene Understanding
Abstract:
We present a novel deep learning architecture for probabilistic future prediction from video. We predict the future semantics, geometry and motion of complex real-world urban scenes and use this representation to control an autonomous vehicle. This work is the first to jointly predict ego-motion, static scene, and the motion of dynamic agents in a probabilistic manner, which allows sampling consistent, highly probable futures from a compact latent space. Our model learns a representation from RGB video with a spatio-temporal convolutional module. The learned representation can be explicitly decoded to future semantic segmentation, depth, and optical flow, in addition to being an input to a learnt driving policy. To model the stochasticity of the future, we introduce a conditional variational approach which minimises the divergence between the present distribution (what could happen given what we have seen) and the future distribution (what we observe actually happens). During inference, diverse futures are generated by sampling from the present distribution."



Paperid:734
Authors:Xiaojiang Peng, Kai Wang, Zhaoyang Zeng, Qing Li, Jianfei Yang, Yu Qiao
Title: Suppressing Mislabeled Data via Grouping and Self-Attention
Abstract:
Deep networks achieve excellent results on large-scale clean data but degrade significantly when learning from noisy labels. To suppressing the impact of mislabeled data, this paper proposes a conceptually simple yet efficient training block, termed as Attentive Feature Mixup (AFM), which allows paying more attention to clean samples and less to mislabeled ones via sample interactions in small groups. Specifically, this plug-and-play AFM first leverages a group-to-attend module to construct groups and assign attention weights for group-wise samples, and then uses a mixup module with the attention weights to interpolate massive noisy-suppressed samples. The AFM has several appealing benefits for noise-robust deep learning. (i) It does not rely on any assumptions and extra clean subset. (ii) With massive interpolations, the ratio of useless samples is reduced dramatically compared to the original noisy ratio. (iii) It jointly optimizes the interpolation weights with classifiers, suppressing the influence of mislabeled data via low attention weights. (iv) It partially inherits the vicinal risk minimization of mixup to alleviate over-fitting while improves it by sampling fewer feature-target vectors around mislabeled data from the mixup vicinal distribution. Extensive experiments demonstrate that AFM yields state-of-the-art results on two challenging real-world noisy datasets: Food101N and Clothing1M. "



Paperid:735
Authors:Hanzhe Hu, Deyi Ji, Weihao Gan, Shuai Bai, Wei Wu, Junjie Yan
Title: Class-wise Dynamic Graph Convolution for Semantic Segmentation
Abstract:
Recent works have made great progress in semantic segmentation by exploiting contextual information in a local or global manner with dilated convolutions, pyramid pooling or self-attention mechanism. In order to avoid potential misleading contextual information aggregation in previous work, we propose a class-wise dynamic graph convolution(CDGC) module to adaptively propagate information. The graph reasoning is performed among pixels in the same class. Based on the proposed CDGC module, we further introduce the Class-wise Dynamic Graph Convolution Network(CDGCNet), which consists of two main parts including the CDGC module and a basic segmentation network, forming a coarse-to-fine paradigm. Specifically, the CDGC module takes the coarse segmentation result as class mask to extract node features for graph construction and performs dynamic graph convolutions on the constructed graph to learn the feature aggregation and weight allocation. Then the refined feature and the original feature are fused to get the final prediction. We conduct extensive experiments on three popular semantic segmentation benchmarks including Cityscapes, PASCAL VOC 2012 and COCO Stuff, and achieve state-of-the-art performance on all three benchmarks. "



Paperid:736
Authors:Yun-Zhu Song, Zhi Rui Tam, Hung-Jen Chen, Huiao-Han Lu, Hong-Han Shuai
Title: Character-Preserving Coherent Story Visualization
Abstract:
Story visualization aims at generating a sequence of images to narrate each sentence in a multi-sentence story. Different from video generation that focuses on maintaining the continuity of generated images (frames), story visualization emphasizes preserving the global consistency of characters and scenes across different story pictures, which is very challenging since story sentences only provide sparse signals for generating images. Therefore, we propose a new framework named Character-Preserving Coherent Story Visualization (CP-CSV) to tackle the challenges. CP-CSV effectively learns to visualize the story by three critical modules: story and context encoder (story and sentence representation learning), figure-ground segmentation (auxiliary task to provide information for preserving character and story consistency), and figure-ground aware generation (image sequence generation by incorporating figure-ground information). Moreover, we propose a metric named Fr\'{e}chet Story Distance (FSD) to evaluate the performance of story visualization. Extensive experiments demonstrate that CP-CSV maintains the details of character information and achieves high consistency among different frames, while FSD better measures the performance of story visualization."



Paperid:737
Authors:Tianyi Wu, Yu Lu, Yu Zhu, Chuang Zhang, MingWu, Zhanyu Ma, Guodong Guo
Title: GINet: Graph Interaction Network for Scene Parsing
Abstract:
Recently, context reasoning using image regions beyond local convolution has shown great potential for scene parsing. In this work, we explore how to incorperate the linguistic knowledge to promote context reasoning over image regions by proposing a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss). The GI unit is capable of enhancing feature representations of convolution networks over high-level semantics and learning the semantic coherency adaptively to each sample. Specifically, the dataset-based linguistic knowledge is first incorporated in the GI unit to promote context reasoning over the visual graph, then the evolved representations of the visual graph are mapped to each local representation to enhance the discriminated capability for scene parsing. GI unit is further improved by the SC-loss to enhance the semantic representations over the exemplar-based semantic graph. We perform full ablation studies to demonstrate the effectiveness of each component in our approach. Particularly, the proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff."



Paperid:738
Authors:Wanli Chen, Xinge Zhu, Ruoqi Sun, Junjun He, Ruiyu Li, Xiaoyong Shen , Bei Yu
Title: Tensor Low-Rank Reconstruction for Semantic Segmentation
Abstract:
Context information plays an indispensable role in the success of semantic segmentation. Recently, non-local self-attention based methods are proved to be effective for context information collection. Since desired context consists of spatial-wise and channel-wise attentions, the 3D representation is an appropriate formulation. However, these non-local methods describe 3D context information based on a 2D similarity matrix, where space compression may lead to channel-wise attention missing. An alternative is to model the contextual information directly without compression. However, this effort confronts a fundamental difficulty, namely the high-rank property of context information. In this paper, we propose a new approach to model the 3D context representations,which not only avoids the space compression, but also tackles the high-rank difficulty. Here, inspired by tensor canonical-polyadic decomposition theory (i.e, a high-rank tensor can be expressed as a combination of rank-1 tensors.), we design a low-rank-to-high-rank con-text reconstruction framework (i.e., RecoNet). Specifically, we first introduce the tensor generation module (TGM), which generates a number of rank-1 tensors to capture fragments of context feature. Then we use these rank-1 tensors to recover the high-rank context features through our proposed tensor reconstruction module (TRM). Extensive experiments show that our method achieves state-of-the-art on various public datasets. Additionally, our proposed method has more than 100 times less computational cost compared with conventional non-local-based methods."



Paperid:739
Authors:Xilai Li, Wei Sun, Tianfu Wu
Title: Attentive Normalization
Abstract:
In state-of-the-art deep neural networks, both feature normalization and feature attention have become ubiquitous with significant performance improvement shown in a vast amount of tasks. They are usually studied as separate modules, however. In this paper, we propose a light-weight integration between the two schema. We present Attentive Normalization (AN). Instead of learning a single affine transformation, AN learns a mixture of affine transformations and utilizes their weighted-sum as the final affine transformation applied to re-calibrate features in an instance-specific way. The weights are learned by leveraging channel-wise feature attention. In experiments, we test the proposed AN using four representative neural architectures (ResNets, DenseNets, MobileNets-v2 and AOGNets) in the ImageNet-1000 classification benchmark and the MS-COCO 2017 object detection and instance segmentation benchmark. AN obtains consistent performance improvement for different neural architectures in both benchmarks with absolute increase of top-1 accuracy in ImageNet-1000 between 0.5% and 2.7%, and absolute increase up to 1.8% and 2.2% for bounding box and mask AP in MS-COCO respectively. We observe that the proposed AN provides a strong alternative to the widely used Squeeze-and-Excitation (SE) module. Our reproducible source code is publicly available."



Paperid:740
Authors:Jin Xie, Hisham Cholakkal, Rao Muhammad Anwer, Fahad Shahbaz Khan, Yanwei Pang, Ling Shao, Mubarak Shah
Title: Count- and Similarity-aware R-CNN for Pedestrian Detection
Abstract:
Recent pedestrian detection methods generally rely on additional supervision, such as visible bounding-box annotations, to handle heavy occlusions. We propose an approach that leverages pedestrian count and proposal similarity information within a two-stage pedestrian detection framework. Both pedestrian count and proposal similarity are derived from standard full-body annotations commonly used to train pedestrian detectors. We introduce a count-weighted detection loss function that assigns higher weights to the detection errors occurring at highly overlapping pedestrians. The proposed loss function is utilized at both stages of the two-stage detector. We further introduce a count-and-similarity branch within the two-stage detection framework, which predicts pedestrian count as well as proposal similarity. Lastly, we introduce a count and similarity-aware NMS strategy to identify distinct proposals. Our approach requires neither part information nor visible bounding-box annotations. Experiments are performed on the CityPersons and CrowdHuman datasets. Our method sets a new state-of-the-art on both datasets. Further, it achieves an absolute gain of 2.4\% over the current state-of-the-art, in terms of log-average miss rate, on the heavily occluded ( extbf{HO}) set of CityPersons test set. Finally, we demonstrate the applicability of our approach for the problem of human instance segmentation. Code and models are available at: https://github.com/Leotju/CaSe ."



Paperid:741
Authors:Gianni Franchi, Andrei Bursuc, Emanuel Aldea, Séverine Dubuisson, Isabelle Bloch
Title: TRADI: Tracking Deep Neural network Weight Distributions
Abstract:
During training, the weights of a Deep Neural Network (DNN) are optimized from a random initialization towards a nearly optimum value minimizing a loss function. Only this final state of the weights is typically kept for testing, while the wealth of information on the geometry of the weight space, accumulated over the descent towards the minimum is discarded. In this work we propose to make use of this knowledge and leverage it for computing the distributions of the weights of the DNN. This can be further used for estimating the epistemic uncertainty of the DNN by aggregating predictions from an ensemble of networks sampled from these distributions. To this end we introduce a method for tracking the trajectory of the weights during optimization, that does neither require any change in the architecture, nor in the training procedure. We evaluate our method, TRADI, on standard classification and regression benchmarks, and on out-of-distribution detection for classification and semantic segmentation. We achieve competitive results, while preserving computational efficiency in comparison to ensemble approaches."



Paperid:742
Authors:Aishan Liu, Tairan Huang, Xianglong Liu, Yitao Xu, Yuqing Ma, Xinyun Chen, Stephen J. Maybank, Dacheng Tao
Title: Spatiotemporal Attacks for Embodied Agents
Abstract:
Adversarial attacks are valuable for providing insights into the blind-spots of deep learning models and help improve their robustness. Existing work on adversarial attacks have mainly focused on static scenes; however, it remains unclear whether such attacks are effective against embodied agents, which could navigate and interact with a dynamic environment. In this work, we take the first step to study adversarial attacks for embodied agents. In particular, we generate spatiotemporal perturbations to form 3D adversarial examples, which exploit the interaction history in both the temporal and spatial dimensions. Regarding the temporal dimension, since agents make predictions based on historical observations, we develop a trajectory attention module to explore scene view contributions, which further help localize 3D objects appeared with highest stimuli. By conciliating with clues from the temporal dimension, along the spatial dimension, we adversarially perturb the physical properties (e.g., texture and 3D shape) of the contextual objects that appeared in the most important scene views. Extensive experiments on the EQA-v1 dataset for several embodied tasks in both the white-box and black-box settings have been conducted, which demonstrate that our perturbations have strong attack and generalization abilities."



Paperid:743
Authors:Qingqiu Huang, Lei Yang, Huaiyi Huang, Tong Wu, Dahua Lin
Title: Caption-Supervised Face Recognition: Training a State-of-the-Art Face Model without Manual Annotation
Abstract:
The advances over the past several years have pushed the performance of face recognition to an amazing level. This great success, to a large extent, is built on top of millions of annotated samples. However, as we endeavor to take the performance to the next level, the reliance on annotated data becomes a major obstacle. We desire to explore an alternative approach, namely using captioned images for training, as an attempt to mitigate this difficulty. Captioned images are widely available on the web, while the captions often contain the names of the subjects in the images. Hence, an effective method to leverage such data would significantly reduce the need of human annotations. However, an important challenge along this way needs to be tackled: the names in the captions are often noisy and ambiguous, especially when there are multiple names in the captions or multiple people in the photos. In this work, we propose a simple yet effective method, which trains a face recognition model by progressively expanding the labeled set via both selective propagation and caption-driven expansion. We build a large-scale dataset of captioned images, which contain 6.3M faces from 305K subjects. Our experiments show that using the proposed method, we can train a state-of-the-art face recognition model without manual annotation (99.65% in LFW). This shows the great potential of caption-supervised face recognition."



Paperid:744
Authors:Liqian Ma, Zhe Lin, Connelly Barnes, Alexei A Efros, Jingwan Lu
Title: Unselfie: Translating Selfies to Neutral-pose Portraits in the Wild
Abstract:
Due to the ubiquity of smartphones, it is popular to take photos of one's self, or ""selfies."" Such photos are convenient to take, because they do not require specialized equipment or a third-party photographer. However, in selfies, constraints such as human arm length often make the body pose look unnatural. To address this issue, we introduce unselfie, a novel photographic transformation that automatically translates a selfie into a neutral-pose portrait. To achieve this, we first collect an unpaired dataset, and introduce a way to synthesize paired training data for self-supervised learning. Then, to unselfie a photo, we propose a new three-stage pipeline, where we first find a target neutral pose, inpaint the body texture, and finally refine and composite the person on the background. To obtain a suitable target neutral pose, we propose a novel nearest pose search module that makes the reposing task easier and enables the generation of multiple neural-pose results among which users can choose the best one they like. Qualitative and quantitative evaluations show the superiority of our pipeline over alternatives."



Paperid:745
Authors:Xiao Yang, Fangyun Wei, Hongyang Zhang, Jun Zhu
Title: Design and Interpretation of Universal Adversarial Patches in Face Detection
Abstract:
We consider universal adversarial patches for faces --- small visual elements whose addition to a face image reliably destroys the performance of face detectors. Unlike previous work that mostly focused on the algorithmic design of adversarial examples in terms of improving the success rate as an attacker, in this work we show an interpretation of such patches that can prevent the state-of-the-art face detectors from detecting the real faces. We investigate a phenomenon: patches designed to suppress real face detection appear face-like. This phenomenon holds generally across different initialization, locations, scales of patches, backbones, and state-of-the-art face detection frameworks. We propose new optimization-based approaches to automatic design of universal adversarial patches for varying goals of the attack, including scenarios in which true positives are suppressed without introducing false positives. Our proposed algorithms perform well on real-world datasets, deceiving state-of-the-art face detectors in terms of multiple precision/recall metrics and transferability."



Paperid:746
Authors:Yang Xiao, Renaud Marlet
Title: Few-Shot Object Detection and Viewpoint Estimation for Objects in the Wild
Abstract:
Detecting objects and estimating their viewpoint in images are key tasks of 3D scene understanding. Recent approaches have achieved excellent results on very large benchmarks for object detection and viewpoint estimation. However, performances are still lagging behind for novel object categories with few samples. In this paper, we tackle the problems of few-shot object detection and few-shot viewpoint estimation. We propose a meta-learning framework that can be applied to both tasks, possibly including 3D data. Our models improve the results on objects of novel classes by leveraging on rich feature information originating from base classes with many samples. A simple joint feature embedding module is proposed to make the most of this feature sharing. Despite its simplicity, our method outperforms state-of-the-art methods by a large margin on a range of datasets, including PASCAL VOC and MS COCO for few-shot object detection, and Pascal3D+ and ObjectNet3D for few-shot viewpoint estimation. And for the first time, we tackle the combination of both few-shot tasks, on ObjectNet3D, showing promising results."



Paperid:747
Authors:Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Otmar Hilliges, Jan Kautz
Title: Weakly Supervised 3D Hand Pose Estimation via Biomechanical Constraints
Abstract:
Estimating 3D hand pose from 2D images is a difficult, inverse problem due to the inherent scale and depth ambiguities. Current state-of-the-art methods train fully supervised deep neural networks with 3D ground-truth data. However, acquiring 3D annotations is expensive, typically requiring calibrated multi-view setups or labour intensive manual annotations. While annotations of 2D keypoints are much easier to obtain, how to efficiently leverage such extit{weakly-supervised} data to improve the task of 3D hand pose prediction remains an important open question. The key difficulty stems from the fact that direct application of additional 2D supervision mostly benefits the 2D proxy objective but does little to alleviate the depth and scale ambiguities. Embracing this challenge we propose a set of novel losses that constrain the prediction of a neural network to lie within the range of biomechanically feasible 3D hand configurations. We show by extensive experiments that our proposed constraints significantly reduce the depth ambiguity and allow the network to more effectively leverage additional 2D annotated images. For example, on the challenging freiHAND dataset, using additional 2D annotation without our proposed biomechanical constraints reduces the depth error by only $15\%$, whereas the error is reduced significantly by $50\%$ when the proposed biomechanical constraints are used. "



Paperid:748
Authors:Mang Ye, Jianbing Shen, David J. Crandall, Ling Shao, Jiebo Luo
Title: Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-Identification
Abstract:
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem. Due to the large intra-class variations and cross-modality discrepancy with large amount of sample noise, it is difficult to learn discriminative part features. Existing VI-ReID methods instead tend to learn global representations, which have limited discriminability and weak robustness to noisy images. In this paper, we propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID. We propose an intra-modality weighted-part attention module to extract discriminative part-aggregated features, by imposing the domain knowledge on the part relationship mining. To enhance robustness against noisy samples, we introduce cross-modality graph structured attention to reinforce the representation with the contextual relations across the two modalities. We also develop a parameter-free dynamic dual aggregation learning strategy to adaptively integrate the two components in a progressive joint training manner. Extensive experiments demonstrate that DDAG outperforms the state-of-the-art methods under various settings."



Paperid:749
Authors:Hai Wang, Wei-shi Zheng, Ling Yingbiao
Title: Contextual Heterogeneous Graph Network for Human-Object Interaction Detection
Abstract:
Human-object interaction (HOI) detection is an important task for understanding human activity. Graph structure is appropriate to denote the HOIs in the scene. Since there is an subordination between human and object---human play subjective role and object play objective role in HOI, the relations between homogeneous entities and heterogeneous entities in the scene should also not be equally the same. However, previous graph models regard human and object as the same kind of nodes and do not consider that the messages are not equally the same between different entities. In this work, we address such a problem for HOI task by proposing a heterogeneous graph network that models humans and objects as different kinds of nodes and incorporates intra-class messages between homogeneous nodes and inter-class messages between heterogeneous nodes. In addition, a graph attention mechanism based on the intra-class context and inter-class context is exploited to improve the learning. Extensive experiments on the benchmark datasets V-COCO and HICO-DET verify the effectiveness of our method and demonstrate the importance to extract intra-class and inter-class messages which are not equally the same in HOI detection."



Paperid:750
Authors:Xi Cheng, Zhenyong Fu, Jian Yang
Title: Zero-Shot Image Super-Resolution with Depth Guided Internal Degradation Learning
Abstract:
In the past few years, we have witnessed the great progress of image super-resolution (SR) thanks to the power of deep learning. However, a major limitation of the current image SR approaches is that they assume a pre-determined degradation model or kernel, e.g. bicubic, controls the image degradation process. This makes them easily fail to generalize in a real-world or non-ideal environment since the degradation model of an unseen image may not obey the pre-determined kernel used when training the SR model. In this work, we present a simple yet effective zero-shot image super-resolution model. Our zero-shot SR model learns an image-specific super-resolution network (SRN) from a low-resolution input image alone, without relying on external training sets. To circumvent the difficulty caused by the unknown internal degradation model of an image, we propose to learn an image-specific degradation simulation network (DSN) together with our image-specific SRN. Specifically, we exploit the depth information, naturally indicating the scales of local image patches, of an image to extract the unpaired high/low-resolution patch collection to train our networks. According to the benchmark test on four datasets with depth labels or estimated depth maps, our proposed depth guided degradation model learning based image super-resolution (DGDML-SR) achieves visually pleasing results and can outperform the state-of-the-arts in perceptual metrics."



Paperid:751
Authors:Dennis Madsen, Andreas Morel-Forster, Patrick Kahr, Dana Rahbani, Thomas Vetter, Marcel Lüthi
Title: A Closest Point Proposal for MCMC-based Probabilistic Surface Registration
Abstract:
We propose to view non-rigid surface registration as a probabilistic inference problem. Given a target surface, we estimate the posterior distribution of surface registrations. We demonstrate how the posterior distribution can be used to build shape models that generalize better and show how to visualize the uncertainty in the established correspondence. Furthermore, in a reconstruction task, we show how to estimate the posterior distribution of missing data without assuming a fixed point-to-point correspondence.We introduce the closest-point proposal for the Metropolis-Hastings algorithm. Our proposal overcomes the limitation of slow convergence compared to a random-walk strategy. As the algorithm decouples inference from modeling the posterior using a propose-and-verify scheme, we show how to choose different distance measures for the likelihood model.All presented results are fully reproducible using publicly available data and our open-source implementation of the registration framework."



Paperid:752
Authors:Yuk Heo, Yeong Jun Koh, Chang-Su Kim
Title: Interactive Video Object Segmentation Using Global and Local Transfer Modules
Abstract:
An interactive video object segmentation algorithm, which takes scribble annotations on query objects as input, is proposed in this paper. We develop a deep neural network, which consists of the annotation network (A-Net) and the transfer network (T-Net). First, given user scribbles on a frame, A-Net yields a segmentation result based on the encoder-decoder architecture. Second, T-Net transfers the segmentation result bidirectionally to the other frames, by employing the global and local transfer modules. The global transfer module conveys the segmentation information in an annotated frame to a target frame, while the local transfer module propagates the segmentation information in a temporally adjacent frame to the target frame. By applying A-Net and T-Net alternately, a user can obtain desired segmentation results with minimal efforts. We train the entire network in two stages, by emulating user scribbles and employing an auxiliary loss. Experimental results demonstrate that the proposed interactive video object segmentation algorithm outperforms the state-of-the-art conventional algorithms."



Paperid:753
Authors:Thomas Eboli, Jian Sun, Jean Ponce
Title: End-to-end Interpretable Learning of Non-blind Image Deblurring
Abstract:
Non-blind image deblurring is typically formulated as a linear least-squares problem regularized by natural priors on the corresponding sharp picture's gradients, which can be solved, for example, using a half-quadratic splitting method with Richardson fixed-point iterations for its least-squares updates and a proximal operator for the auxiliary variable updates. We propose to precondition the Richardson solver using approximate inverse filters of the (known) blur and natural image prior kernels. Using convolutions instead of a generic linear preconditioner allows extremely efficient parameter sharing across the image, and leads to significant gains in accuracy and/or speed compared to classical FFT and conjugate-gradient methods. More importantly, the proposed architecture is easily adapted to learning both the preconditioner and the proximal operator using CNN embeddings. This yields a simple and efficient algorithm for non-blind image deblurring which is fully interpretable, can be learned end to end, and whose accuracy matches or exceeds the state of the art, quite significantly, in the non-uniform case."



Paperid:754
Authors:Junsong Fan, Zhaoxiang Zhang, Tieniu Tan
Title: Employing Multi-Estimations for Weakly-Supervised Semantic Segmentation
Abstract:
Image-level label based weakly-supervised semantic segmentation (WSSS) aims to adopt image-level labels to train semantic segmentation models, saving vast human labors for costly pixel-level annotations. A typical pipeline for this problem is first to adopt class activation maps (CAM) with image-level labels to generate pseudo-masks (a.k.a. seeds) and then use them for training segmentation models. The main difficulty is that seeds are usually sparse and incomplete. Related works typically try to alleviate this problem by adopting many bells and whistles to enhance the seeds. Instead of struggling to refine a single seed, we propose a novel approach to alleviate the inaccurate seed problem by leveraging the segmentation model's robustness to learn from multiple seeds. We managed to generate many different seeds for each image, which are different estimates of the underlying ground truth. The segmentation model simultaneously exploits these seeds to learn and automatically decides the confidence of each seed. Extensive experiments on Pascal VOC 2012 demonstrate the advantage of this multi-seeds strategy over previous state-of-the-art."



Paperid:755
Authors:Jing Zhang, Jianwen Xie, Nick Barnes
Title: Learning Noise-Aware Encoder-Decoder from Noisy Labels by Alternating Back-Propagation for Saliency Detection
Abstract:
In this paper, we propose a noise-aware encoder-decoder framework to disentangle a clean saliency predictor from noisy training examples, where the noisy labels are generated by unsupervised handcrafted feature-based methods. The proposed model consists of two sub-models parameterized by neural networks: (1) a saliency predictor that maps input images to clean saliency maps, and (2) a noise generator, which is a latent variable model that produces noises from Gaussian latent vectors. The whole model that represents noisy labels is a sum of the two sub-models. The goal of training the model is to estimate the parameters of both sub-models, and simultaneously infer the corresponding latent vector of each noisy label. We propose to train the model by using an alternating back-propagation (ABP) algorithm, which alternates the following two steps: (1) learning back-propagation for estimating the parameters of two sub-models by gradient ascent, and (2) inferential back-propagation for inferring the latent vectors of training noisy examples by Langevin Dynamics. To prevent the network from converging to trivial solutions, we utilize an edge-aware smoothness loss to regularize hidden saliency maps to have similar structures as their corresponding images. Experimental results on several benchmark datasets indicate the effectiveness of the proposed model. "



Paperid:756
Authors:Yinglong Wang, Yibing Song, Chao Ma, Bing Zeng
Title: Rethinking Image Deraining via Rain Streaks and Vapors
Abstract:
Single image deraining regards an input image as a fusion of a background image, a transmission map, rain streaks, and atmosphere light. While advanced models are proposed for image restoration (i.e., background image generation), they regard rain streaks with the same properties as background rather than transmission medium. As vapors (i.e., rain streaks accumulation or fog-like rain) are conveyed in the transmission map to model the veiling effect, the fusion of rain streaks and vapors do not naturally reflect the rain image formation. In this work, we reformulate rain streaks as transmission medium together with vapors to model rain imaging. We propose an encoder-decoder CNN named as SNet to learn the transmission map of rain streaks. As rain streaks appear with various shapes and directions, we use ShuffleNet units within SNet to capture their anisotropic representations. As vapors are brought by rain streaks, we propose a VNet containing spatial pyramid pooling (SSP) to predict the transmission map of vapors in multi-scales based on that of rain streaks. Meanwhile, we use an encoder CNN named ANet to estimate atmosphere light. The SNet, VNet, and ANet are jointly trained to predict transmission maps and atmosphere light for rain image restoration. Extensive experiments on the benchmark datasets demonstrate the effectiveness of the proposed visual model to predict rain streaks and vapors. The proposed deraining method performs favorably against state-of-the-art deraining approaches."



Paperid:757
Authors:Marcelo Gennari do Nascimento, Theo W. Costain, Victor Adrian Prisacariu
Title: Finding Non-Uniform Quantization Schemes using Multi-Task Gaussian Processes
Abstract:
We propose a novel method for neural network quantization that casts the neural architecture search problem as one of hyperparameter search to find non-uniform bit distributions throughout the layers of a CNN. We perform the search assuming a Multi-Task Gaussian Processes prior, which splits the problem to multiple tasks, each corresponding to different number of training epochs, and explore the space by sampling those configurations that yield maximum information. We then show that with significantly lower precision in the last layers we achieve a minimal loss of accuracy with appreciable memory savings. We test our findings on the CIFAR10 and ImageNet datasets using the VGG, ResNet and GoogLeNet architectures."



Paperid:758
Authors:Daksh Thapar, Chetan Arora, Aditya Nigam
Title: Is Sharing of Egocentric Video Giving Away Your Biometric Signature?
Abstract:
Easy availability of wearable egocentric cameras, and the sense of privacy propagated by the fact that the wearer is never seen in the captured videos, has led to a tremendous rise in public sharing of such videos. Unlike hand-held cameras, egocentric cameras are harnessed on the wearer’s head, which makes it possible to track the wearer’s head motion by observing optical flow in the egocentric videos. In this work, we create a novel kind of privacy attack by extracting the wearer’s gait profile, a well known biometric signature, from such optical flow in the egocentric videos. We demonstrate strong wearer recognition capabilities based on extracted gait features, an unprecedented and critical weakness completely absent in hand-held videos. We demonstrate the following attack scenarios: (1) In a closed-set scenario, we show that it is possible to recognize the wearer of an egocentric video with an accuracy of more than 92.5% on the benchmark video dataset. (2) In an open-set setting, when the system has not seen the camera wearer even once during the training, we show that it is still possible to identify that the two egocentric videos have been captured by the same wearer with an Equal Error Rate (EER) of less than 14.35%. (3) We show that it is possible to extract gait signature even if only sparse optical flow and no other scene information from egocentric video is available. We demonstrate the accuracy of more than 84% for wearer recognition with only global optical flow. (4) While the first person to first person matching does not give us access to the wearer’s face, we show that it is possible to match the extracted gait features against the one obtained from a third person view such as a surveillance camera looking at the wearer in a completely different background at a different time. In essence, our work indicates that sharing one’s egocentric video should be treated as giving away one’s biometric identity and recommend much more oversight before sharing of egocentric videos. The code, trained models, and the datasets and their annotations are available at https://egocentricbiometric.github.io/"



Paperid:759
Authors:Danna Gurari, Yinan Zhao, Meng Zhang, Nilavra Bhattacharya
Title: Captioning Images Taken by People Who Are Blind
Abstract:
While an important problem in the vision community is to design algorithms that can automatically caption images, few publicly-available datasets for algorithm development directly address the interests of real users. Observing that people who are blind have relied on (human-based) image captioning services to learn about images they take for nearly a decade, we introduce the first image captioning dataset to represent this real use case. This new dataset, which we call VizWiz-Captions, consists of over 39,000 images originating from people who are blind that are each paired with five captions. We analyze this dataset to (1) characterize the typical captions, (2) characterize the diversity of content found in the images, and (3) compare its content to that found in eight popular vision datasets. We also analyze modern image captioning algorithms to identify what makes this new dataset challenging for the vision community. We publicly-share the dataset with captioning challenge instructions at https://vizwiz.org."



Paperid:760
Authors:Xiangtai Li, Xia Li, Li Zhang, Guangliang Cheng, Jianping Shi, Zhouchen Lin, Shaohua Tan, Yunhai Tong
Title: Improving Semantic Segmentation via Decoupled Body and Edge Supervision
Abstract:
the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires extit{explicitly} modeling the object extit{body} and extit{edge}, which correspond to the high and low frequency of the image. To do so, we first warp the image feature by learning a flow field to make the object part more consistent.The resulting body feature and the residual edge feature are further optimized under decoupled supervision by explicitly sampling different parts (body or edge) pixels.We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.Extensive experiments on four major road scene semantic segmentation benchmarks including extit{Cityscapes}, extit{CamVid}, extit{KIITI} and extit{BDD} show that our proposed approach establishes new state of the art while retaining high efficiency in inference. In particular, we achieve 83.7 mIoU \% on Cityscape with only fine-annotated data. Code and models are made available to foster any further research(\url{https://github.com/lxtGH/DecoupleSegNets}). "



Paperid:761
Authors:Jerry Liu, Shenlong Wang, Wei-Chiu Ma, Meet Shah, Rui Hu, Pranaab Dhawan, Raquel Urtasun
Title: Conditional Entropy Coding for Efficient Video Compression
Abstract:
We propose a very simple and efficient video compression framework that only focuses on modeling the conditional entropy between frames. Unlike prior learning-based approaches, we reduce complexity by not performing any form of explicit transformations between frames and assume each frame is encoded with an independent state-of-the-art deep image compressor. We first show that a simple architecture modeling the entropy between the image latent codes is as competitive as other neural video compression works and video codecs while being much faster and easier to implement. We then propose a novel internal learning extension on top of this architecture that brings an additional 10% bitrate savings without trading off decoding speed. Importantly, we show that our approach outperforms H.265 and other deep learning baselines in MS-SSIM on higher bitrate UVG video, and against all video codecs on lower framerates, while being thousands of times faster in decoding than deep models utilizing an autoregressive entropy model. "



Paperid:762
Authors:Yushuo Guan, Pengyu Zhao, Bingxuan Wang, Yuanxing Zhang, Cong Yao, Kaigui Bian, Jian Tang
Title: Differentiable Feature Aggregation Search for Knowledge Distillation
Abstract:
Knowledge distillation has become increasingly important in model compression. It boosts the performance of a miniaturized student network with the supervision of the output distribution and feature maps from a sophisticated teacher network. Some recent works introduce multi-teacher distillation to provide more supervision to the student network. However, the effectiveness of multi-teacher distillation methods are accompanied by costly computation resources. To tackle with both the efficiency and the effectiveness of knowledge distillation, we introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework by extracting informative supervision from multiple teacher feature maps. Specifically, we introduce DFA, a two-stage Differentiable Feature Aggregation search method that motivated by DARTS in neural architecture search, to efficiently find the aggregations. In the first stage, DFA formulates the searching problem as a bi-level optimization and leverages a novel bridge loss, which consists of a student-to-teacher path and a teacher-to-student path, to find appropriate feature aggregations. The two paths act as two players against each other, trying to optimize the unified architecture parameters to the opposite directions while guaranteeing both expressivity and learnability of the feature aggregation simultaneously. In the second stage, DFA performs knowledge distillation with the derived feature aggregation. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets under various teacher-student settings, verifying the effectiveness and robustness of the design."



Paperid:763
Authors:Shashanka Venkataramanan, Kuan-Chuan Peng, Rajat Vikram Singh, Abhijit Mahalanobis
Title: Attention Guided Anomaly Localization in Images
Abstract:
Anomaly localization is an important problem in computer vision which involves localizing anomalous regions within images with applications in industrial inspection, surveillance, and medical imaging. This task is challenging due to the small sample size and pixel coverage of the anomaly in real-world scenarios. Most prior works need to use anomalous training images to compute a class-specific threshold to localize anomalies. Without the need of anomalous training images, we propose Convolutional Adversarial Variational autoencoder with Guided Attention (CAVGA), which localizes the anomaly with a convolutional latent variable to preserve the spatial information. In the unsupervised setting, we propose an attention expansion loss where we encourage CAVGA to focus on all normal regions in the image. Furthermore, in the weakly-supervised setting we propose a complementary guided attention loss, where we encourage the attention map to focus on all normal regions while minimizing the attention map corresponding to anomalous regions in the image. CAVGA outperforms the state-of-the-art (SOTA) anomaly localization methods on MVTec Anomaly Detection (MVTAD), modified ShanghaiTech Campus (mSTC) and Large-scale Attention based Glaucoma (LAG) datasets in the unsupervised setting and when using only 2% anomalous images in the weakly-supervised setting. CAVGA also outperforms SOTA anomaly detection methods on the MNIST, CIFAR-10,Fashion-MNIST, MVTAD, mSTC and LAG datasets."



Paperid:764
Authors:Jiangliu Wang, Jianbo Jiao, Yun-Hui Liu
Title: Self-supervised Video Representation Learning by Pace Prediction
Abstract:
This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, g, slow motion, a widely used technique in film making. Specifically, given a video played in natural pace, we randomly sample training clips in different paces and ask a neural network to identify the pace for each video clip. The assumption here is that the network can only succeed in such a pace reasoning task when it understands the underlying video content and learns representative spatio-temporal features. In addition, we further introduce contrastive learning to push the model towards discriminating different paces by maximizing the agreement on similar video content. To validate the effectiveness of the proposed method, we conduct extensive experiments on action recognition and video retrieval tasks with several alternative network architectures. Experimental evaluations show that our approach achieves state-of-the-art performance for self-supervised video representation learning across different network architectures and different benchmarks. The code and pre-trained models are available at \url{https://github.com/laura-wang/video-pace}.\keywords{Self-supervised learning nd Video representation nd Pace.}"



Paperid:765
Authors:Chris Rockwell, David F. Fouhey
Title: Full-Body Awareness from Partial Observations
Abstract:
There has been great progress in human 3D mesh recovery and great interest in learning about the world from consumer video data. Unfortunately current methods for 3D human mesh recovery work rather poorly on consumer video data, since on the Internet, unusual camera viewpoints and aggressive truncations are the norm rather than a rarity. We study this problem and make a number of contributions to address it: (i) we propose a simple but highly effective self-training framework that adapts human 3D mesh recovery systems to consumer videos and demonstrate its application to two recent systems; (ii) we introduce evaluation protocols and keypoint annotations for 13K frames across four consumer video datasets for studying this task, including evaluations on out-of-image keypoints; and (iii) we show that our method substantially improves PCK and human-subject judgments compared to baselines, both on test videos from the dataset it was trained on, as well as on three other datasets without further adaptation."



Paperid:766
Authors:Lijie Liu, Chufan Wu, Jiwen Lu, Lingxi Xie, Jie Zhou, Qi Tian
Title: Reinforced Axial Refinement Network for Monocular 3D Object Detection
Abstract:
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image. This is an ill-posed problem with a major difficulty lying in the information loss by depth-agnostic cameras. Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space. To improve the efficiency of sampling, we propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step. This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it. The proposed framework, Reinforced Axial Refinement Network (RAR-Net), serves as a post-processing stage which can be freely integrated into existing monocular 3D detection methods, and improve the performance on the KITTI dataset with small extra computational costs."



Paperid:767
Authors:Ehsan Elhamifar, Dat Huynh
Title: Self-Supervised Multi-Task Procedure Learning from Instructional Videos
Abstract:
We address the problem of unsupervised procedure learning from instructional videos of multiple tasks using Deep Neural Networks (DNNs). Unlike existing works, we assume that training videos come from multiple tasks without key-step annotations or grammars, and the goals are to classify a test video to the underlying task and to localize its key-steps. Our DNN learns task-dependent attention features from informative regions of each frame without ground-truth bounding boxes and learns to discover and localize key-steps without key-step annotations by using an unsupervised subset selection module as a teacher. It also learns to classify an input video using the discovered key-steps using a learnable key-step feature pooling mechanism that extracts and learns to combine key-step based features for task recognition. By experiments on two instructional video datasets, we show the effectiveness of our method for unsupervised localization of procedure steps and video classification."



Paperid:768
Authors:Yann Labbé, Justin Carpentier, Mathieu Aubry, Josef Sivic
Title: CosyPose: Consistent multi-view multi-object 6D pose estimation
Abstract:
We introduce an approach for recovering the 6D pose of multiple known objects in a scene captured by a set of input images with unknown camera viewpoints. First, we present a single-view single-object 6D pose estimation method, which we use to generate 6D object pose hypotheses. Second, we develop a robust method for matching individual 6D object pose hypotheses across different input images in order to jointly estimate camera viewpoints and 6D poses of all objects in a single consistent scene. Our approach explicitly handles object symmetries, does not require depth measurements, is robust to missing or incorrect object hypotheses, and automatically recovers the number of objects in the scene. Third, we develop a method for global scene refinement given multiple object hypotheses and their correspondences across views. This is achieved by solving an object-level bundle adjustment problem that refines the poses of cameras and objects to minimize the reprojection error in all views. We demonstrate that the proposed method, dubbed Cosy-Pose, outperforms current state-of-the-art results for single-view and multi-view 6D object pose estimation by a large margin on two challenging benchmarks: the YCB-Video and T-LESS datasets. Code and pre-trained models are available on the project webpage."



Paperid:769
Authors:Jiapeng Zhu, Yujun Shen, Deli Zhao, Bolei Zhou
Title: In-Domain GAN Inversion for Real Image Editing
Abstract:
Recent work has shown that a variety of semantics emerge in the latent space of Generative Adversarial Networks (GANs) when being trained to synthesize images. However, it is difficult to use these learned semantics for real image editing. A common practice of feeding a real image to a trained GAN generator is to invert it back to a latent code. However, existing inversion methods typically focus on reconstructing the target image by pixel values yet fail to land the inverted code in the semantic domain of the original latent space. As a result, the reconstructed image cannot well support semantic editing through varying the inverted code. To solve this problem, we propose an in-domain GAN inversion approach, which not only faithfully reconstructs the input image but also ensures the inverted code to be semantically meaningful for editing. We first learn a novel domain-guided encoder to project a given image to the native latent space of GANs. We then propose domain-regularized optimization by involving the encoder as a regularizer to fine-tune the code produced by the encoder and better recover the target image. Extensive experiments suggest that our inversion method achieves satisfying real image reconstruction and more importantly facilitates various image editing tasks, significantly outperforming start-of-the-arts."



Paperid:770
Authors:Yuexi Zhang, Yin Wang, Octavia Camps, Mario Sznaier
Title: Key Frame Proposal Network for Efficient Pose Estimation in Videos
Abstract:
Human pose estimation in video relies on local information by either estimating each frame independently or tracking poses across frames. In this paper, we propose a novel method combining local approaches with global context. We introduce a light weighted, unsupervised, key-frame proposal network (K-FPN) to select informative frames and a learned dictionary to recover the entire pose sequence from these frames. The K-FPN speeds up the pose estimation and provides robustness to bad frames with occlusion, motion blur, and illumination changes, while the learned dictionary provides global dynamic context. Experiments on Penn Action and sub-JHMDB datasets show that the proposed method achieves state-of-the-art accuracy, with substantial speed-up."



Paperid:771
Authors:Yuki Saito, Takuma Nakamura, Hirotaka Hachiya, Kenji Fukumizu
Title: Exchangeable Deep Neural Networks for Set-to-Set Matching and Learning
Abstract:
Matching two different sets of items, called heterogeneous set-to-set matching problem, has recently received attention as a promising problem. The difficulties are to extract features to match a correct pair of different sets and also preserve two types of exchangeability required for set-to-set matching: the pair of sets, as well as the items in each set, should be exchangeable. In this study, we propose a novel deep learning architecture to address the abovementioned difficulties and also an efficient training framework for set-to-set matching. We evaluate the methods through experiments based on two industrial applications: fashion set recommendation and group re-identification. In these experiments, we show that the proposed method provides significant improvements and results compared with the state-of-the-art methods, thereby validating our architecture for the heterogeneous set matching problem."



Paperid:772
Authors:Robin Rombach, Patrick Esser, Björn Ommer
Title: Making Sense of CNNs: Interpreting Deep Representations & Their Invariances with INNs
Abstract:
To tackle increasingly complex tasks, it has become an essential ability of neural networks to learn abstract representations. These task-specific representations and, particularly, the invariances they capture turn neural networks into black box models that lack interpretability. To open such a black box, it is, therefore, crucial to uncover the different semantic concepts a model has learned as well as those that it has learned to be invariant to. We present an approach based on INNs that (i) recovers the task-specific, learned invariances by disentangling the remaining factor of variation in the data and that (ii) invertibly transforms these recovered invariances combined with the model representation into an equally expressive one with accessible semantic concepts. As a consequence, neural network representations become understandable by providing the means to (i) expose their semantic meaning, (ii) semantically modify a representation, and (iii) visualize individual learned semantic concepts and invariances. Our invertible approach significantly extends the abilities to understand black box models by enabling post-hoc interpretations of state-of-the-art networks without compromising their performance."



Paperid:773
Authors:Gongyang Li, Zhi Liu, Linwei Ye, Yang Wang, Haibin Ling
Title: Cross-Modal Weighting Network for RGB-D Salient Object Detection
Abstract:
Depth maps contain geometric clues for assisting Salient Object Detection (SOD). In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD. Specifically, three RGB-depth interaction modules, named CMW-L, CMW-M and CMW-H, are developed to deal with respectively low-, middle- and high-level cross-modal information fusion. These modules use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW) to allow rich cross-modal and cross-scale interactions among feature layers generated by different network blocks. To effectively train the proposed Cross-Modal Weighting Network (CMWNet), we design a composite loss function that summarizes the errors between intermediate predictions and ground truth over different scales. With all these novel components working together, CMWNet effectively fuses information from RGB and depth channels, and meanwhile explores object localization and details across scales. Thorough evaluations demonstrate CMWNet consistently outperforms 15 state-of-the-art RGB-D SOD methods on seven popular benchmarks."



Paperid:774
Authors:Rui Shao, Pramuditha Perera, Pong C. Yuen, Vishal M. Patel
Title: Open-set Adversarial Defense
Abstract:
Open-set recognition and adversarial defense study two key aspects of deep learning that are vital for real-world deployment. The objective of open-set recognition is to identify samples from open-set classes during testing, while adversarial defense aims to defend the network against images with imperceptible adversarial perturbations. In this paper, we show that open-set recognition systems are vulnerable to adversarial attacks. Furthermore, we show that adversarial defense mechanisms trained on known classes do not generalize well to open-set samples. Motivated by this observation, we emphasize the need of an Open-Set Adversarial Defense (OSAD) mechanism. This paper proposes an Open-Set Defense Network (OSDN) as a solution to the OSAD problem. The proposed network uses an encoder with feature-denoising layers coupled with a classifier to learn a noise-free latent feature representation. Two techniques are employed to obtain an informative latent feature space with the objective of improving open-set performance. First, a decoder is used to ensure that clean images can be reconstructed from the obtained latent features. Then, self-supervision is used to ensure that the latent features are informative enough to carry out an auxiliary task. We introduce a testing protocol to evaluate OSAD performance and show the effectiveness of the proposed method in multiple object classification datasets. The implementation code of the proposed method is available at: \href{https://github.com/rshaojimmy/ECCV2020-OSAD}{https://github.com/rshaojimmy/ECCV2020-OSAD}."



Paperid:775
Authors:Sharon Ayzik, Shai Avidan
Title: Deep Image Compression using Decoder Side Information
Abstract:
We present a Deep Image Compression neural network that relies on side information, which is only available to the decoder. We base our algorithm on the assumption that the image available to the encoder and the image available to the decoder are correlated, and we let the network learn these correlations in the training phase.Then, at run time, the encoder side encodes the input image without knowing anything about the decoder side image and sends it to the decoder. The decoder then uses the encoded input image and the side information image to reconstruct the original image.This problem is known as Distributed Source Coding (DSC) in Information Theory, and we discuss several use cases for this technology. We compare our algorithm to several image compression algorithms and show that adding decoder-only side information does indeed improve results."



Paperid:776
Authors:Jeevan Devaranjan, Amlan Kar, Sanja Fidler
Title: Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation
Abstract:
Generation of synthetic data has allowed Machine Learning practitioners to bypass the need for costly collection and labeling of large datasets. Unfortunately the generation of such data often requires experts to carefully design sampling procedures that guarantee creation of realistic scenes. These sampling procedures typically require experts to specify certain structural aspects such as scene layout information. This is often hard to design, as scenes are generally highly complex and diverse. In this paper, we propose a generative model of synthetic scenes that reduces the distribution gap between the scene structure of generated scenes and a real target image dataset. Importantly, since labeling scene structures for real images is incredibly cumbersome, our method operates without any ground truth structure information for real data. Experiments on two synthetic datasets and a real driving dataset show that the method successfully bridges the distribution gap of discrete structural features between real and generated images."



Paperid:777
Authors:Ahmed Taha, Xitong Yang, Abhinav Shrivastava, Larry Davis
Title: A Generic Visualization Approach for Convolutional Neural Networks
Abstract:
Retrieval networks are essential for searching and indexing. Compared to classification networks, attention visualization for retrieval networks is hardly studied. We formulate attention visualization as a constrained optimization problem. We leverage the unit L2-Norm constraint as an attention filter (L2-CAF) to localize attention in both classification and retrieval networks. Unlike recent literature, our approach requires neither architectural changes nor fine-tuning. Thus, a pre-trained network's performance is never undermined L2-CAF is quantitatively evaluated using weakly supervised object localization. State-of-the-art results are achieved on classification networks. For retrieval networks, significant improvement margins are achieved over a Grad-CAM baseline. Qualitative evaluation demonstrates how the L2-CAF visualizes attention per frame for a recurrent retrieval network. Further ablation studies highlight the computational cost of our approach and compare L2-CAF with other feasible alternatives. Code available at https://bit.ly/3iDBLFv"



Paperid:778
Authors:Tianchang Shen, Jun Gao, Amlan Kar, Sanja Fidler
Title: Interactive Annotation of 3D Object Geometry using 2D Scribbles
Abstract:
Inferring detailed 3D geometry of the scene is crucial for robotics applications, simulation, and 3D content creation. However, such information is hard to obtain, and thus very few datasets support it. In this paper, we propose an interactive framework for annotating 3D object geometry from both point cloud data and RGB imagery. The key idea behind our approach is to exploit strong priors that humans have about the 3D world in order to interactively annotate complete 3D shapes. Our framework targets a wide pool of annotators, i.e. naive users without artistic or graphics expertise. In particular, we introduce two simple-to-use interaction modules. First, we make an automatic guess of the 3D shape and allow the user to provide feedback about large errors by drawing scribbles in desired 2D views. Next, we aim to correct minor errors, in which users drag and drop 3D mesh vertices, assisted by a neural interactive module implemented as a Graph Convolutional Network. Experimentally, we show that only a few user interactions are needed to produce good quality 3D shapes on popular benchmarks such as ShapeNet, Pix3D, and ScanNet. We implement our framework as a web service and conduct a user study, where we show that user annotated data using our method effectively facilitates real-world learning tasks. Our web service will be released. "



Paperid:779
Authors:Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Košecká, Ziyan Wu
Title: Hierarchical Kinematic Human Mesh Recovery
Abstract:
We consider the problem of estimating a parametric model of 3D human mesh from a single image. While there has been substantial recent progress in this area with direct regression of model parameters, these methods only implicitly exploit the human body kinematic structure, leading to sub-optimal use of the model prior. In this work, we address this gap by proposing a new technique for regression of human parametric model that is explicitly informed by the known hierarchical structure, including joint interdependencies of the model. This results in a strong prior-informed design of the regressor architecture and an associated hierarchical optimization that is flexible to be used in conjunction with the current standard frameworks for 3D human mesh recovery. We demonstrate these aspects by means of extensive experiments on standard benchmark datasets, showing how our proposed new design outperforms several existing and popular methods, establishing new state-of-the-art results. By considering joint interdependencies, our method is equipped to infer joints even under data corruptions, which we demonstrate by conducting experiments under varying degrees of occlusion."



Paperid:780
Authors:Jae-Han Lee, Chang-Su Kim
Title: Multi-Loss Rebalancing Algorithm for Monocular Depth Estimation
Abstract:
An algorithm to combine multiple loss terms adaptively for training a monocular depth estimator is proposed in this work. We construct a loss function space containing tens of losses. Using more losses can improve inference capability without any additional complexity in the test phase. However, when many losses are used, some of them may be neglected during training. Also, since each loss decreases at a different speed, adaptive weighting is required to balance the contributions of the losses. To address these issues, we propose the loss rebalancing algorithm that initializes and rebalances the weight for each loss function adaptively in the course of training. Experimental results show that the proposed algorithm provides state-of-the-art depth estimation results on various datasets. Codes are available at https://github.com/jaehanlee-mcl/multi-loss-rebalancing-depth."



Paperid:781
Authors:Marc Badger, Yufu Wang, Adarsh Modh, Ammon Perkes, Nikos Kolotouros , Bernd G. Pfrommer, Marc F. Schmidt, Kostas Daniilidis
Title: 3D Bird Reconstruction: a Dataset, Model, and Shape Recovery from a Single View
Abstract:
Model, and Shape Recovery from a Single View","Automated capture of animal pose is transforming how we study neuroscience and social behavior. Movements carry important social cues, but current methods are not able to robustly estimate pose and shape of animals, particularly for social animals such as birds, which are often occluded by each other and objects in the environment. To address this problem, we first introduce a model and multi-view optimization approach, which we use to capture the unique shape and pose space displayed by live birds. We then introduce a pipeline and experiments for keypoint, mask, pose, and shape regression that recovers accurate avian postures from single views. Finally, we provide extensive multi-view keypoint and mask annotations collected from a group of 15 social birds housed together in an outdoor aviary. The project website with videos, results, code, mesh model, and the Penn Aviary Dataset can be found at https://marcbadger.github.io/avian-mesh."



Paperid:782
Authors:Alex Andonian, Camilo Fosco, Mathew Monfort, Allen Lee, Rogerio Feris, Carl Vondrick, Aude Oliva
Title: We Have So Much In Common: Modeling Semantic Relational Set Abstractions in Videos
Abstract:
Identifying common patterns among events is a key capability for human and machine perception, as it underlies intelligent decision making. Here, we propose an approach for learning semantic relational set abstractions on videos, inspired by human learning. Our model combines visual features as input with natural language supervision to generate high-level representations of similarities across a set of videos. This allows our model to perform cognitive tasks such as set abstraction (what is in common among a set of videos?), set completion (which new video goes well with the set?), and odd one out detection (which video does not belong in the set?). Experiments on two video benchmarks, Kinetics and Multi-Moments in Time, show that robust and versatile representations emerge when learning to recognize commonalities among sets. We compare our model to several baseline algorithms and show that significant improvements result from explicitly learning relational abstractions with semantic supervision."



Paperid:783
Authors:Samuel Zeitvogel, Johannes Dornheim, Astrid Laubenheimer
Title: Joint Optimization for Multi-Person Shape Models from Markerless 3D-Scans
Abstract:
We propose a markerless end-to-end training framework for parametric 3D human shape models. The training of statistical 3D human shape models with minimal supervision is an important problem in computer vision. Contrary to prior work, the whole training process (i) uses a differentiable shape model surface and (ii) is trained end-to-end by jointly optimizing all parameters of a single, self-contained objective that can be solved with slightly modified off-the-shelf non-linear least squares solvers. The training process only requires a compact model definition and an off-the-shelf 2D RGB pose estimator. No pre-trained shape models are required. For training (iii) a medium-sized dataset of approximately 1000 low-resolution human body scans is sufficient to achieve competitive performance on the challenging FAUST surface correspondence benchmark. The training and evaluation code will be made available for research purposes to facilitate end-to-end shape model training on novel datasets with minimal setup cost."



Paperid:784
Authors:Wei Ji, Jingjing Li, Miao Zhang, Yongri Piao, Huchuan Lu
Title: Accurate RGB-D Salient Object Detection via Collaborative Learning
Abstract:
Benefiting from the spatial cues embedded in depth images, recent progress on RGB-D saliency detection shows impressive ability on some challenge scenarios. However, there are still two limitations. One hand is that the pooling and upsampling operations in FCNs might cause blur object boundaries. On the other hand, using an additional depth-network to extract depth features might lead to high computation and storage cost. The reliance on depth inputs during testing also limits the practical applications of current RGB-D models. In this paper, we propose a novel collaborative learning framework where edge, depth and saliency are leveraged in a more efficient way, which solves those problems tactfully. The explicitly extracted edge information goes together with saliency to give more emphasis to the salient regions and object boundaries. Depth and saliency learning is innovatively integrated into the high-level feature learning process in a mutual-benefit manner. This strategy enables the network to be free of using extra depth networks and depth inputs to make inference. To this end, it makes our model more lightweight, faster and more versatile. Experiment results on seven benchmark datasets show its superior performance."



Paperid:785
Authors:David Griffiths, Jan Boehm, Tobias Ritschel
Title: Finding Your (3D) Center: 3D Object Detection Using a Learned Loss
Abstract:
Massive semantically labeled datasets are readily available for 2D images, however, are much harder to achieve for 3D scenes. Objects in 3D repositories like ShapeNet are labeled, but regrettably only in isolation, so without context. 3D scenes can be acquired by range scanners on city-level scale, but much fewer with semantic labels. Addressing this disparity, we introduce a new optimization procedure, which allows training for 3D detection with raw 3D scans while using as little as 5\,\% of the object labels and still achieve comparable performance. Our optimization uses two networks. A scene network maps an entire 3D scene to a set of 3D object centers. As we assume the scene not to be labeled by centers, no classic loss, such as Chamfer can be used to train it. Instead, we use another network to emulate the loss. This loss network is trained on a small labeled subset and maps a non-centered 3D object in the presence of distractions to its own center. This function is very similar -- and hence can be used instead of -- the gradient the supervised loss would provide. Our evaluation documents competitive fidelity at a much lower level of supervision, respectively higher quality at comparable supervision. Supplementary material can be found at: https://dgriffiths3.github.io."



Paperid:786
Authors:Ganlong Zhao, Guanbin Li, Ruijia Xu, Liang Lin
Title: Collaborative Training between Region Proposal Localization and Classification for Domain Adaptive Object Detection
Abstract:
Object detectors are usually trained with large amount of labeled data, which is expensive and labor-intensive. Pre-trained detectors applied to unlabeled dataset always suffer from the difference of dataset distribution, also called domain shift. Domain adaptation for object detection tries to adapt the detector from labeled datasets to unlabeled ones for better performance. In this paper, we are the first to reveal that the region proposal network (RPN) and region proposal classifier (RPC) in the endemic two-stage detectors (e.g., Faster RCNN) demonstrate significantly different transferability when facing large domain gap. The region classifier shows preferable performance but is limited without RPN's high-quality proposals while simple alignment in the backbone network is not effective enough for RPN adaptation. We delve into the consistency and the difference of RPN and RPC, treat them individually and leverage high-confidence output of one as mutual guidance to train the other. Moreover, the samples with low-confidence are used for discrepancy calculation between RPN and RPC and minimax optimization. Extensive experimental results on various scenarios have demonstrated the effectiveness of our proposed method in both domain-adaptive region proposal generation and object detection. Code is available at https://github.com/GanlongZhao/CST_DA_detection."



Paperid:787
Authors:Zudi Lin, Donglai Wei, Won-Dong Jang, Siyan Zhou, Xupeng Chen, Xueying Wang, Richard Schalek, Daniel Berger, Brian Matejek, Lee Kamentsky, Adi Peleg, Daniel Haehn, Thouis Jones, Toufiq Parag, Jeff Lichtman, Hanspeter Pfister
Title: Two Stream Active Query Suggestion for Active Learning in Connectomics
Abstract:
For large-scale vision tasks in biomedical images, the labeled data is often limited to train effective deep models. Active learning is a common solution, where a query suggestion method selects representative unlabeled samples for annotation, and the new labels are used to improve the base model. However, most query suggestion models optimize their learnable parameters only on the limited labeled data and consequently become less effective for the more challenging unlabeled data. To tackle this, we propose a two-stream active query suggestion approach. In addition to the supervised feature extractor, we introduce an unsupervised one optimized on all raw images to capture diverse image features, which can later be improved by fine-tuning on new labels. As a use case, we build an end-to-end active learning framework with our query suggestion method for 3D synapse detection and mitochondria segmentation in connectomics. With the framework, we curate, to our best knowledge, the largest connectomics dataset with dense synapses and mitochondria annotation. On this new dataset, our method outperforms previous state-of-the-art methods by 3.1% for synapse and 3.8% for mitochondria in terms of region-of-interest proposal accuracy. We also apply our method to image classification, where it outperforms previous approaches on CIFAR-10 under the same limited annotation budget. The project page is https://zudi-lin.github.io/projects/#two_stream_active."



Paperid:788
Authors:Jiahui Lei, Srinath Sridhar, Paul Guerrero, Minhyuk Sung, Niloy Mitra, Leonidas J. Guibas
Title: Pix2Surf: Learning Parametric 3D Surface Models of Objects from Images
Abstract:
We investigate the problem of learning to generate 3D parametric surface representations for novel object instances, as seen from one or more views. Previous work on learning shape reconstruction from multiple views uses discrete representations such as point clouds or voxels, while continuous surface generation approaches lack multi-view consistency. We address these issues by designing neural networks capable of generating high-quality parametric 3D surfaces which are also consistent between views. Furthermore, the generated 3D surfaces preserve accurate image pixel to 3D surface point correspondences, allowing us to lift texture information to reconstruct shapes with rich geometry and appearance. Our method is supervised and trained on a public dataset of shapes from common object categories. Quantitative results indicate that our method significantly outperforms previous work, while qualitative results demonstrate the high quality of our reconstructions."



Paperid:789
Authors:Mai Bui, Tolga Birdal, Haowen Deng, Shadi Albarqouni, Leonidas Guibas, Slobodan Ilic, Nassir Navab
Title: 6D Camera Relocalization in Ambiguous Scenes via Continuous Multimodal Inference
Abstract:
We present a multimodal camera relocalization framework that captures ambiguities and uncertainties with continuous mixture models defined on the manifold of camera poses. In highly ambiguous environments, which can easily arise due to symmetries and repetitive structures in the scene, computing one plausible solution (what most state-of-the-art methods currently regress) may not be sufficient. Instead we predict multiple camera pose hypotheses as well as the respective uncertainty for each prediction. Towards this aim, we use Bingham distributions, to model the orientation of the camera pose, and a multivariate Gaussian to model the position, with an end-to-end deep neural network. By incorporating a Winner-Takes-All training scheme, we finally obtain a mixture model that is well suited for explaining ambiguities in the scene, yet does not suffer from mode collapse, a common problem with mixture density networks. We introduce a new dataset specifically designed to foster camera localization research in ambiguous environments and exhaustively evaluate our method on synthetic as well as real data on both ambiguous scenes and on non-ambiguous benchmark datasets. We plan to release our code and dataset under multimodal3dvision.github.io."



Paperid:790
Authors:Hung-Yu Tseng, Matthew Fisher, Jingwan Lu, Yijun Li, Vladimir Kim, Ming-Hsuan Yang
Title: Modeling Artistic Workflows for Image Generation and Editing
Abstract:
People often create art by following an artistic workflow involving multiple stages that inform the overall design. If an artist wishes to modify an earlier decision, significant work may be required to propagate this new decision forward to the final artwork. Motivated by the above observations, we propose a generative model that follows a given artistic workflow, enabling both multi-stage image generation as well as multi-stage image editing of an existing piece of art. Furthermore, for the editing scenario, we introduce an optimization process along with learning-based regularization to ensure the edited image produced by the model closely aligns with the originally provided image. Qualitative and quantitative results on three different artistic datasets demonstrate the effectiveness of the proposed framework on both image generation and editing tasks."



Paperid:791
Authors:Sangpil Kim, Hyung-gun Chi, Xiao Hu, Qixing Huang, Karthik Ramani
Title: A Large-scale Annotated Mechanical Components Benchmark for Classification and Retrieval Tasks with Deep Neural Networks
Abstract:
We introduce a large-scale annotated mechanical components benchmark for classification and retrieval tasks named MechanicalComponents Benchmark (MCB): a large-scale dataset of 3D objects of mechanical components. The dataset enables data-driven feature learn-ing for mechanical components. Exploring the shape descriptor for mechanical components is essential to computer vision and manufacturing applications. However, not much attention has been given on creating an-notated mechanical components datasets on a large-scale. This is because acquiring 3D models is challenging and annotating mechanical components requires engineering knowledge. Our main contributions are the creation of a large-scale annotated mechanical component benchmark, defining hierarchy taxonomy of mechanical components, and benchmark-ing the effectiveness of deep learning shape classifiers on the mechanical components. We created an annotated dataset and benchmarked seven state-of-the-art deep learning classification methods in three categories, namely: (1) point clouds, (2) volumetric representation in voxel grids, and (3) view-based representation."



Paperid:792
Authors:Jin Sun, Hadar Averbuch-Elor, Qianqian Wang, Noah Snavely
Title: Hidden Footprints: Learning Contextual Walkability from 3D Human Trails
Abstract:
Predicting where people can walk in a scene is important for many tasks, including autonomous driving systems and human behavior analysis. Yet learning a computational model for this purpose is challenging due to semantic ambiguity and a lack of labeled data: current datasets only have labels on where people $ extit{are}$, not where they $ extit{could be}$. We tackle this problem by leveraging information from existing datasets, without any additional labeling. We first augment the set of valid walkable regions by propagating person observations between images, utilizing 3D information and temporal coherence, leading to $ extit{Hidden Footprints}$. We then design a training strategy that combines a class-balanced classification loss with a contextual adversarial loss to learn from sparse observations, thus obtaining a model that predicts a walkability map of a given scene. We evaluate our model on the Waymo and Cityscapes datasets, demonstrating superior performance against baselines and state-of-the-art models."



Paperid:793
Authors:Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Title: Self-Supervised Learning of Audio-Visual Objects from Video
Abstract:
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets. Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection."



Paperid:794
Authors:Yu Shen, Junbang Liang, Ming C. Lin
Title: GAN-based Garment Generation Using Sewing Pattern Images
Abstract:
The generation of realistic apparel model has become increasingly popular as a result of the rapid pace of change in fashion trends and the growing need for garment models in various applications such as virtual try-on. For such application requirements, it is important to have a general cloth model that can represent a diverse set of garments. Previous studies often make certain assumptions about the garment, such as the topology or suited body shape. We propose a unified method using the generative network. Our model is applicable to different garment topologies with different sewing patterns and fabric materials. We also develop a novel image representation of garment models, and a reliable mapping algorithm between the general garment model and the image representation that can regularize the data representation of the cloth. Using this special intermediate image representation, the generated garment model can be easily retargeted to another body, enabling garment customization. In addition, a large garment appearance dataset is provided for use in garment reconstruction, garment capturing, and other applications. We demonstrate that our generative model has high reconstruction accuracy and can provide rich variations of virtual garments."



Paperid:795
Authors:Chaitanya Ahuja, Dong Won Lee, Yukiko I. Nakano, Louis-Philippe Morency
Title: Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach
Abstract:
How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: http://chahuja.com/mix-stage"



Paperid:796
Authors:Rui Huang, Wanyue Zhang, Abhijit Kundu, Caroline Pantofaru, David A Ross, Thomas Funkhouser, Alireza Fathi
Title: An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds
Abstract:
Detecting objects in 3D LiDAR data is a core technology for autonomous driving and other robotics applications. Although LiDAR data is acquired over time, most of the 3D object detection algorithms propose object bounding boxes independently for each frame and neglect the useful information available in the temporal domain. To address this problem, in this paper we propose a sparse LSTM-based multi-frame 3d object detection algorithm. We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud. These features are fed to the LSTM module together with the hidden and memory features from last frame to predict the 3d objects in the current frame as well as hidden and memory features that are passed to the next frame. Experiments on the Waymo Open Dataset show that our algorithm outperforms the traditional frame by frame approach by 7.5% mAP@0.7 and other multi-frame approaches by 1.2% while using less memory and computation per frame. To the best of our knowledge, this is the first work to use an LSTM for 3D object detection in sparse point clouds."



Paperid:797
Authors:Tamar Loeub, Aviad Levis, Vadim Holodovsky, Yoav Y. Schechner
Title: Monotonicity Prior for Cloud Tomography
Abstract:
We introduce a differentiable monotonicity prior, useful to express signals of monotonic tendency. An important natural signal of this tendency is the optical extinction coefficient, as a function of altitude in a cloud. Cloud droplets become larger as vapor condenses on them in an updraft. Reconstruction of the volumetric structure of clouds is important for climate research. Data for such reconstruction is multi-view images taken simultaneously of each cloud. This acquisition mode is expected by upcoming future spaceborne imagers. We achieve three-dimensional volumetric reconstruction through stochastic scattering tomography, which is based on optimization of a cost function. Part of the cost is the monotonicity prior, which helps to improve the reconstruction quality. The stochastic tomography is based on Monte-Carlo radiative transfer. It is formulated and implemented in a coarse-to-fine form, making it scalable to large fields."



Paperid:798
Authors:Lezi Wang, Dong Liu, Rohit Puri, Dimitris N. Metaxas
Title: Learning Trailer Moments in Full-Length Movies with Co-Contrastive Attention
Abstract:
A movie's key moments stand out of the screenplay to grab an audience's attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised ootnote{The term ""supervised"" refers to the approach with access to the manual ground-truth annotations for training.} approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks."



Paperid:799
Authors:Christopher Thomas, Adriana Kovashka
Title: Preserving Semantic Neighborhoods for Robust Cross-modal Retrieval
Abstract:
The abundance of multimodal data (e.g. social media posts) has inspired interest in cross-modal retrieval methods. Popular approaches rely on a variety of metric learning losses, which prescribe what the proximity of image and text should be, in the learned space. However, most prior methods have focused on the case where image and text convey redundant information; in contrast, real-world image-text pairs convey complementary information with little overlap. Further, images in news articles and media portray topics in a visually diverse fashion; thus, we need to take special care to ensure a meaningful image representation. We propose novel within-modality losses which encourage semantic coherency in both the text and image subspaces, which does not necessarily align with visual coherency. Our method ensures that not only are paired images and texts close, but the expected image-image and text-text relationships are also observed. Our approach improves the results of cross-modal retrieval on four datasets compared to five baselines."



Paperid:800
Authors:Vishvak Murahari, Dhruv Batra, Devi Parikh, Abhishek Das
Title: Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
Abstract:
Prior work in visual dialog has focused on training deep neural models on VisDial in isolation. Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. We adapt the recently proposed ViLBERT model (Lu et al. 2019) for multi-turn visually-grounded conversations. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial. Our best single model outperforms prior published work (including model ensembles) by more than 1% absolute on NDCG and MRR.Next, we find that additional finetuning using ""dense"" annotations in VisDial leads to even higher NDCG -- more than 10% over our base model -- but hurts MRR -- more than 17% below our base model! This highlights a trade-off between the two primary metrics -- NDCG and MRR -- which we find is due to dense annotations not correlating well with the original ground-truth answers to questions."



Paperid:801
Authors:Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira
Title: Learning to Generate Grounded Visual Captions without Localization Supervision
Abstract:
When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model. The most common way of relating image regions with words in caption models is through an attention mechanism over the regions that are used as input to predict the next word. The model must therefore learn to predict the attentional weights without knowing the word it should localize. This is difficult to train without grounding supervision since recurrent models can propagate past information and there is no explicit signal to force the captioning model to properly ground the individual decoded words. In this work, we help the model to achieve this via a novel cyclical training regimen that forces the model to localize each word in the image after the sentence decoder generates it, and then reconstruct the sentence from the localized image region(s) to match the ground-truth. Our proposed framework only requires learning one extra fully-connected layer (the localizer), a layer that can be removed at test time. We show that our model significantly improves grounding accuracy without relying on grounding supervision or introducing extra computation during inference, for both image and video captioning tasks. Code is available at https://github.com/chihyaoma/cyclical-visual-captioning ."



Paperid:802
Authors:Menglei Chai, Jian Ren, Sergey Tulyakov
Title: Neural Hair Rendering
Abstract:
In this paper, we propose a generic neural-based hair rendering pipeline that can synthesize photo-realistic images from virtual 3D hair models. Unlike existing supervised translation methods that require model-level similarity to preserve consistent structure representation for both real images and fake renderings, our method adopts an unsupervised solution to work on arbitrary hair models. The key component of our method is a shared latent space to encode appearance-invariant structure information of both domains, which generates realistic renderings conditioned by extra appearance inputs. This is achieved by domain-specific pre-disentangled structure representation, partially shared domain encoder layers and a structure discriminator. We also propose a simple yet effective temporal conditioning method to enforce consistency for video sequence generation. We demonstrate the superiority of our method by testing it on a large number of portraits and comparing it with alternative baselines and state-of-the-art unsupervised image translation methods."



Paperid:803
Authors:Noranart Vesdapunt, Mitch Rundle, HsiangTao Wu, Baoyuan Wang
Title: JNR: Joint-based Neural Rig Representation for Compact 3D Face Modeling
Abstract:
In this paper, we introduce a novel approach to learn a 3D face model using a joint-based face rig and a neural skinning network. Thanks to the joint-based representation, our model enjoys some significant advantages over prior blendshape-based models. First, it is very compact such that we are orders of magnitude smaller while still keeping strong modeling capacity. Second, because each joint has its semantic meaning, interactive facial geometry editing is made easier and more intuitive. Third, through skinning, our model supports adding mouth interior and eyes, as well as accessories (hair, eye glasses, etc.) in a simpler, more accurate and principled way. We argue that because the human face is highly structured and topologically consistent, it does not need to be learned entirely from data. Instead we can leverage prior knowledge in the form of a human-designed 3D face rig to reduce the data dependency, and learn a compact yet strong face model from only a small dataset (less than one hundred 3D scans). To further improve the modeling capacity, we train a skinning weight generator through adversarial learning. Experiments on fitting high-quality 3D scans (both neutral and expressive), noisy depth images, and RGB images demonstrate that its modeling capacity is on-par with state-of-the-art face models, such as FLAME and Facewarehouse, even though the model is 10 to 20 times smaller. This suggests broad value in both graphics and vision applications on mobile and edge devices."



Paperid:804
Authors:Yaojie Liu, Joel Stehouwer, Xiaoming Liu
Title: On Disentangling Spoof Trace for Generic Face Anti-Spoofing
Abstract:
Prior studies show that the key to face anti-spoofing lies in the subtle image pattern, termed “spoof trace”, e.g., color distortion, 3D mask edge, Moir´e pattern, and many others. Designing a generic anti-spoofing model to estimate those spoof traces can improve not only the generalization of the spoof detection, but also the interpretability of the model’s decision. Yet, this is a challenging task due to the diversity of spoof types and the lack of ground truth in spoof traces. This work designs a novel adversarial learning framework to disentangle the spoof traces from input faces as a hierarchical combination of patterns at multiple scales. With the disentangled spoof traces, we unveil the live counterpart of the original spoof face, and further synthesize realistic new spoof faces after a proper geometric correction. Our method demonstrates superior spoof detection performance on both seen and unseen spoof scenarios while providing visually-convincing estimation of spoof traces. Code is available at https://github.com/yaojieliu/ECCV20-STDN."



Paperid:805
Authors:Wei Han, Zhengdong Zhang, Benjamin Caine, Brandon Yang, Christoph Sprunk, Ouais Alsharif, Jiquan Ngiam, Vijay Vasudevan, Jonathon Shlens, Zhifeng Chen
Title: Streaming Object Detection for 3-D Point Clouds
Abstract:
Autonomous vehicles operate in a dynamic environment, where the speed with which a vehicle can perceive and react impacts the safety and efficacy of the system. LiDAR provides a central and prominent sensory modality that informs many existing perceptual systems including object detection, segmentation, motion estimation and action recognition. The latency for perceptual system based on point cloud data can be dominated by the amount of time for a complete rotational scan (e.g. 100 ms). This built-in data capture latency is artificial, and based on treating the point cloud as a camera image in order to leverage camera-inspired architectures. However, unlike CCD camera sensors, most LiDAR point cloud data is natively a {\it streaming} data source in which laser reflections are sequentially recorded based on the rotational angle of the laser beam. In this work, we explore how to build an object detector that removes this artificial latency constraint, and instead operates on native streaming data in order to significantly reduce latency. This approach has the added benefit of reducing the peak computational burden on inference hardware by spreading the computation over the complete scan sequence. We demonstrate a family of streaming detection systems based on sequential modeling through a series of modifications to the traditional object detection meta-architecture. We highlight how this model may achieve competitive if not superior predictive performance with state-of-the-art, traditional non-streaming detection systems while achieving significant latency gains (e.g. 1/15'th - 1/3'rd of peak latency). Our results show that operating on LiDAR data in its native streaming formulation offers several advantages for self driving object detection -- advantages that we hope will be useful for any LiDAR perception system where minimizing latency is critical for safe and efficient operation."



Paperid:806
Authors:Yun-Chun Chen, Chen Gao, Esther Robb, Jia-Bin Huang
Title: NAS-DIP: Learning Deep Image Prior with Neural Architecture Search
Abstract:
Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior for solving various inverse image restoration tasks. Instead of using hand-designed architectures, we propose to search for neural architectures that capture stronger image priors. Building upon a generic U-Net architecture, our core contribution lies in designing new search spaces for (1) an upsampling cell and (2) a pattern of cross-scale residual connections. We search for an improved network by leveraging an existing neural architecture search algorithm (using reinforcement learning with a recurrent neural network controller). We validate the effectiveness of our method via a wide variety of applications, including image restoration, dehazing, image-to-image translation, and matrix factorization. Extensive experimental results show that our algorithm performs favorably against state-of-the-art learning-free approaches and reaches competitive performance with existing learning-based methods in some cases."



Paperid:807
Authors:Yun-Chun Chen, Chao-Te Chou, Yu-Chiang Frank Wang
Title: Learning to Learn in a Semi-Supervised Fashion
Abstract:
To address semi-supervised learning from both labeled and unlabeled data, we present a novel meta-learning scheme. We particularly consider that labeled and unlabeled data share disjoint ground truth label sets, which can be seen tasks like in person re-identification or image retrieval. Our learning scheme exploits the idea of leveraging information from labeled to unlabeled data. Instead of fitting the associated class-wise similarity scores as most meta-learning algorithms do, we propose to derive semantics-oriented similarity representations from labeled data, and transfer such representation to unlabeled ones. Thus, our strategy can be viewed as a self-supervised learning scheme, which can be applied to fully supervised learning tasks for improved performance. Our experiments on various tasks and settings confirm the effectiveness of our proposed approach and its superiority over the state-of-the-art methods."



Paperid:808
Authors:Chia-Wen Kuo, Chih-Yao Ma, Jia-Bin Huang, Zsolt Kira
Title: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning
Abstract:
Recent state-of-the-art semi-supervised learning (SSL) methods use a combination of image-based transformations and consistency regularization as core components. Such methods, however, are limited to simple transformations such as traditional data augmentation or convex combinations of two images. In this paper, we propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations. Importantly, these transformations also use information from both within-class and across-class prototypical representations that we extract through clustering. We use features already computed across iterations by storing them in a memory bank, obviating the need for significant extra computation. These transformations, combined with traditional image-based augmentation, are then used as part of the consistency-based regularization loss. We demonstrate that our method is comparable to current state of art for smaller datasets (CIFAR-10 and SVHN) while being able to scale up to larger datasets such as CIFAR-100 and mini-Imagenet where we achieve significant gains over the state of art ( extit{e.g.,} absolute 17.44\% gain on mini-ImageNet). We further test our method on DomainNet, demonstrating better robustness to out-of-domain unlabeled data, and perform rigorous ablations and analysis to validate the method."



Paperid:809
Authors:Bin Yang, Runsheng Guo, Ming Liang, Sergio Casas, Raquel Urtasun
Title: RadarNet: Exploiting Radar for Robust Perception of Dynamic Objects
Abstract:
We tackle the problem of exploiting Radar for perception in the context of self-driving as Radar provides complementary information to other sensors such as LiDAR or cameras in the form of Doppler velocity. The main challenges of using Radar are the noise and measurement ambiguities which have been a struggle for existing simple input or output fusion methods. To better address this, we propose a new solution that exploits both LiDAR and Radar sensors for perception. Our approach, dubbed RadarNet, features a voxel-based early fusion and an attention-based late fusion, which learn from data to exploit both geometric and dynamic information of Radar data. RadarNet achieves state-of-the-art results on two large-scale real-world datasets in the tasks of object detection and velocity estimation. We further show that exploiting Radar improves the perception capabilities of detecting faraway objects and understanding the motion of dynamic objects."



Paperid:810
Authors:Medhini Narasimhan, Erik Wijmans, Xinlei Chen, Trevor Darrell, Dhruv Batra, Devi Parikh, Amanpreet Singh
Title: Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation
Abstract:
We introduce a learning-based approach for room navigation using semantic maps. Our proposed architecture learns to predict top-down belief maps of regions that lie beyond the agent’s field of view while modeling architectural and stylistic regularities in houses. First, we train a model to generate amodal semantic top-down maps indicating beliefs of location, size, and shape of rooms by learning the underlying architectural patterns in houses. Next, we use these maps to predict a point that lies in the target room and train a policy to navigate to the point. We empirically demonstrate that by predicting semantic maps, the model learns common correlations found in houses and generalizes to novel environments. We also demonstrate that reducing the task of room navigation to point navigation improves the performance further. We will make our code publicly available and hope our work paves the way for further research in this space. "



Paperid:811
Authors:Chenhongyi Yang, Vitaly Ablavsky, Kaihong Wang, Qi Feng, Margrit Betke
Title: Learning to Separate: Detecting Heavily-Occluded Objects in Urban Scenes
Abstract:
While visual object detection with deep learning has received much attention in the past decade, cases when heavy intra-class occlusions occur have not been studied thoroughly. In this work, we propose a novel Non-Maximum-Suppression (NMS) algorithm that dramatically improves the detection recall while maintaining high precision in scenes with heavy occlusions. Our NMS algorithm is derived from a novel embedding mechanism, in which the semantic and geometric features of the detected boxes are jointly exploited. The embedding makes it possible to determine whether two heavily-overlapping boxes belong to the same object in the physical world. Our approach is particularly useful for car detection and pedestrian detection in urban scenes where occlusions often happen. We show the effectiveness of our approach by creating a model called SG-Det (short for Semantics and Geometry Detection) and testing SG-Det on two widely-adopted datasets, KITTI and CityPersons for which it achieves state-of-the-art performance."



Paperid:812
Authors:Guha Balakrishnan, Yuanjun Xiong, Wei Xia, Pietro Perona
Title: Towards causal benchmarking of bias in face analysis algorithms
Abstract:
Measuring algorithmic bias is crucial both to assess algorithmic fairness, and to guide the improvement of algorithms. Current bias measurement methods in computer vision are based on observational datasets, and conflate algorithmic bias with dataset bias. To address this problem we develop an experimental method for measuring algorithmic bias of face analysis algorithms, which directly manipulates the attributes of interest, e.g., gender and skin tone, in order to reveal causal links between attribute variation and performance change. Our method is based on generating synthetic image grids that differ along specific attributes while leaving other attributes constant. A crucial aspect of our approach is relying on the perception of human observers, both to guide manipulations, and to measure algorithmic bias. We validate our method by comparing it to a traditional observational bias analysis study in gender classification algorithms. The two methods reach different conclusions. While the observational method reports gender and skin color biases, the experimental method reveals biases due to gender, hair length, age, and facial hair. We also show that our synthetic transects allow for more straightforward bias analysis on minority and intersectional groups."



Paperid:813
Authors:Tong He, Dong Gong, Zhi Tian, Chunhua Shen
Title: Learning and Memorizing Representative Prototypes for 3D Point Cloud Semantic and Instance Segmentation
Abstract:
3D point cloud semantic and instance segmentation are crucial and fundamental for 3D scene understanding. Due to the complex structure, point sets are distributed off-balance and diversely, appearing as both category and pattern imbalance. It has been proved that deep networks can easily forget the non-dominant cases during training, which influences the model generalization and leads to unsatisfactory performance. Although re-weighting on instances may reduce the influence, it is hard to find a balance between the dominant and the non-dominant cases. To tackle the above issue, we propose a memory-augmented network that learns and memorizes the representative prototypes that encode both geometry and semantic information. The prototypes are shared by diverse 3D points and recorded in a universal memory module. During training, the memory slots are dynamically associated with both dominant and non-dominant cases, alleviating the forgetting issue. In testing, the distorted observations and rare cases can thus be augmented by retrieving the stored prototypes, leading to better generalization. Experiments on the benchmarks, i.e., S3DIS and ScanNetV2, show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on non-dominant classes and samples."



Paperid:814
Authors:Noa Garcia, Yuta Nakashima
Title: Knowledge-Based Video Question Answering with Unsupervised Scene Descriptions
Abstract:
To understand movies, humans constantly reason over the dialogues and actions shown in specific scenes and relate them to the overall storyline already seen. Inspired by this behaviour, we design ROLL, a model for knowledge-based video story question answering that leverages three crucial aspects of movie understanding: dialog comprehension, scene reasoning, and storyline recalling. In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion. To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism, which balances the information from the different sources. Exhaustive evaluation demonstrates the effectiveness of our approach, which yields a new state-of-the-art on two challenging video question answering datasets: KnowIT VQA and TVQA+."



Paperid:815
Authors:Aamir Mustafa, Rafal K. Mantiuk
Title: Transformation Consistency Regularization – A Semi-Supervised Paradigm for Image-to-Image Translation
Abstract:
Scarcity of labeled data has motivated the development of semi-supervised learning methods, which learn from large portions of unlabeled data alongside a few labeled samples. Consistency Regularization between model's predictions under different input perturbations, particularly has shown to provide state-of-the art results in a semi-supervised framework. However, most of these method have been limited to classification and segmentation applications. We propose Transformation Consistency Regularization, which delves into a more challenging setting of image-to-image translation, which remains unexplored by semi-supervised algorithms. The method introduces a diverse set of geometric transformations and enforces the model's predictions for unlabeled data to be invariant to those transformations. We evaluate the efficacy of our algorithm on three different applications: image colorization, denoising and super-resolution. Our method is significantly data efficient, requiring only around 10-20% of labeled samples to achieve similar image reconstructions to its fully-supervised counterpart. Furthermore, we show the effectiveness of our method in video processing applications, where knowledge from a few frames can be leveraged to enhance the quality of the rest of the movie. "



Paperid:816
Authors:Jianzhao Liu, Jianxin Lin, Xin Li, Wei Zhou, Sen Liu, Zhibo Chen
Title: LIRA: Lifelong Image Restoration from Unknown Blended Distortions
Abstract:
Most existing image restoration networks are designed in a disposable way and catastrophically forget previously learned distortions when trained on a new distortion removal task. To alleviate this problem, we raise the novel lifelong image restoration problem for blended distortions. We first design a base fork-join model in which multiple pre-trained expert models specializing in individual distortion removal task work cooperatively and adaptively to handle blended distortions. When the input is degraded by a new distortion, inspired by adult neurogenesis in human memory system, we develop a neural growing strategy where the previously trained model can incorporate a new expert branch and continually accumulate new knowledge without interfering with learned knowledge. Experimental results show that the proposed approach can not only achieve state-of-the-art performance on blended distortions removal tasks in both PSNR/SSIM metrics, but also maintain old expertise while learning new restoration tasks."



Paperid:817
Authors:Jiahao Lin, Gim Hee Lee
Title: HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization
Abstract:
Current works on multi-person 3D pose estimation mainly focus on the estimation of the 3D joint locations relative to the root joint and ignore the absolute locations of each pose. In this paper, we propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization in the camera coordinate space. Our HDNet first estimates the 2D human pose with heatmaps of the joints. These estimated heatmaps serve as attention masks for pooling features from image regions corresponding to the target person. A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints. We formulate the target depth regression as a bin index estimation problem, which can be transformed with a soft-argmax operation from the classification output of our HDNet. We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets, i.e., Human3.6M and MuPoTS-3D. The experimental results show that we outperform the previous state-of-the-art consistently under multiple evaluation metrics. Our source code is available at: https://github.com/jiahaoLjh/HumanDepth ."



Paperid:818
Authors:Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, Lei Li
Title: SOLO: Segmenting Objects by Locations
Abstract:
We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of ""instance categories"", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a single-shot classification-solvable problem. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent single-shot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation."



Paperid:819
Authors:Song Zhang, Yu Zhang, Zhe Jiang, Dongqing Zou, Jimmy Ren, Bin Zhou
Title: Learning to See in the Dark with Events
Abstract:
Imaging in the dark environment is important for many real-world applications like video surveillance. Recently, the development of Event Cameras raises promising directions in solving this task thanks to its High Dynamic Range (HDR) and low requirement of computational sources. However, such cameras record sparse, asynchronous intensity changes of the scene (called events), instead of canonical images. In this paper, we propose learning to see in the dark by translating the HDR events in low light to canonical sharp images as if captured in day light. Since it is extremely challenging to collect paired event-image training data, a novel unsupervised domain adaptation network is proposed which explicitly separates domain-invariant features (e.g. scene structures) from the domain-specific ones (e.g. detailed textures) to ease representation learning. A detail enhancing branch is proposed to reconstruct day light-specific features from the domain-invariant representations in a residual manner, regularized by a ranking loss. To evaluate the proposed approach, a novel large-scale dataset is captured with a DAVIS240C camera with both day/low light events and intensity images. Experiments on this dataset show that the proposed domain adaptation approach achieves superior performance than various state-of-the-art architectures."



Paperid:820
Authors:Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, Marco Pavone
Title: Trajectron++: Dynamically-Feasible Trajectory Forecasting With Heterogeneous Data
Abstract:
Reasoning about human motion is an important prerequisite to safe and socially-aware robotic navigation. As a result, multi-agent behavior prediction has become a core component of modern human-robot interactive systems, such as self-driving cars. While there exist many methods for trajectory forecasting, most do not enforce dynamic constraints and do not account for environmental information (e.g., maps). Towards this end, we present Trajectron++, a modular, graph-structured recurrent model that forecasts the trajectories of a general number of diverse agents while incorporating agent dynamics and heterogeneous data (e.g., semantic maps). Trajectron++ is designed to be tightly integrated with robotic planning and control frameworks; for example, it can produce predictions that are optionally conditioned on ego-agent motion plans. We demonstrate its performance on several challenging real-world trajectory forecasting datasets, outperforming a wide array of state-of-the-art deterministic and generative methods."



Paperid:821
Authors:Xudong Lin, Lin Ma, Wei Liu, Shih-Fu Chang
Title: Context-Gated Convolution
Abstract:
As the basic building block of Convolutional Neural Networks (CNNs), the convolutional layer is designed to extract local patterns and lacks the ability to model global context in its nature. Many efforts have been recently devoted to complementing CNNs with the global modeling ability, especially by a family of works on global feature interaction. In these works, the global context information is incorporated into local features before they are fed into convolutional layers. However, research on neuroscience reveals that the neurons' ability of modifying their functions dynamically according to context is essential for the perceptual tasks, which has been overlooked in most of CNNs. Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features. Moreover, our proposed CGC is lightweight and applicable with modern CNN architectures, and consistently improves the performance of CNNs according to extensive experiments on image classification, action recognition, and machine translation. Our code of this paper is available at https://github.com/XudongLinthu/context-gated-convolution."



Paperid:822
Authors:Bingke Wang, Zilei Wang, Yixin Zhang
Title: Polynomial Regression Network for Variable-Number Lane Detection
Abstract:
Lane detection is a fundamental yet challenging task in autonomous driving and intelligent traffic systems due to perspective projection and occlusion. Most of previous methods utilize semantic segmentation to identify the regions of traffic lanes in an image, and then adopt some curve-fitting method to reconstruct the lanes. In this work, we propose to use polynomial curves to represent traffic lanes and then propose a novel polynomial regression network (PRNet) to directly predict them, where semantic segmentation is not involved. Specifically, PRNet consists of one major branch and two auxiliary branches: (1) polynomial regression to estimate the polynomial coefficients of lanes, (2) initialization classification to detect the initial retrieval point of each lane, and (3) height regression to determine the ending point of each lane. Through the cooperation of three branches, PRNet can detect variable-number of lanes and is highly effective and efficient. We experimentally evaluate the proposed PRNet on two popular benchmark datasets: TuSimple and CULane. The results show that our method significantly outperforms the previous state-of-the-art methods in terms of both accuracy and speed. "



Paperid:823
Authors:Wenzhao Zheng, Jiwen Lu, Jie Zhou
Title: Structural Deep Metric Learning for Room Layout Estimation
Abstract:
In this paper, we propose a structural deep metric learning (SDML) method for room layout estimation, which aims to recover the 3D spatial layout of a cluttered indoor scene from a monocular RGB image. Different from existing room layout estimation methods that solve a regression or per-pixel classification problem, we formulate the room layout estimation problem from a metric learning perspective where we explicitly model the structural relations across different images. We propose to learn a latent embedding space where the Euclidean distance can characterize the actual structural difference between the layouts of two rooms. We then minimize the discrepancy between an image and its ground-truth layout in the learned embedding space. We employ a metric model and a layout encoder to map the RGB images and the ground-truth layouts to the embedding space, respectively, and a layout decoder to map the embeddings to the corresponding layouts, where the whole framework is trained in an end-to-end manner. We perform experiments on the widely used Hedau and LSUN datasets and achieve state-of-the-art performance. "



Paperid:824
Authors:Chenghao Liu, Zhihao Wang, Doyen Sahoo, Yuan Fang Kun Zhang, Steven C.H. Hoi
Title: Adaptive Task Sampling for Meta-Learning
Abstract:
Meta-learning methods have been extensively studied and applied in computer vision, especially for few-shot classification tasks. The key idea of meta-learning for few-shot classification is to mimic the few-shot situations faced at test time by randomly sampling classes in meta-training data to construct few-shot tasks for episodic training. While a rich line of work focuses solely on how to extract meta-knowledge across tasks, we exploit the complementary problem on how to generate informative tasks. We argue that the randomly sampled tasks could be sub-optimal and uninformative (e.g., the task of classifying ``dog"" from ``laptop"" is often trivial) to the meta-learner. In this paper, we propose an adaptive task sampling method to improve the generalization performance. Unlike instance based sampling, task based sampling is much more challenging due to the implicit definition of the task in each episode. Therefore, we accordingly propose a greedy class-pair based sampling method, which selects difficult tasks according to class-pair potentials. We evaluate our adaptive task sampling method on two few-shot classification benchmarks, and it achieves consistent improvements across different feature backbones, meta-learning algorithms and datasets."



Paperid:825
Authors:Yuting He, Tiantian Li, Guanyu Yang, Youyong Kong, Yang Chen, Huazhong Shu, Jean-Louis Coatrieux, Jean-Louis Dillenseger, Shuo Li
Title: Deep Complementary Joint Model for Complex Scene Registration and Few-shot Segmentation on Medical Images
Abstract:
Deep learning-based medical image registration and segmentation joint models utilize the complementarity (augmentation data or weakly supervised data from registration, region constraints from segmentation) to bring mutual improvement in complex scene and few-shot situation. However, further adoption of the joint models are hindered: 1) the diversity of augmentation data is reduced limiting the further enhancement of segmentation, 2) misaligned regions in weakly supervised data disturb the training process, 3) lack of label-based region constraints in few-shot situation limits the registration performance. We propose a novel Deep Complementary Joint Model (DeepRS) for complex scene registration and few-shot segmentation. We embed a perturbation factor in the registration to increase the activity of deformation thus maintaining the augmentation data diversity. We take a pixel-wise discriminator to extract alignment confidence maps which highlight aligned regions in weakly supervised data so the misaligned regions' disturbance will be suppressed via weighting. The outputs from segmentation model are utilized to implement deep-based region constraints thus relieving the label requirements and bringing fine registration. Extensive experiments on the CT dataset of MM-WHS 2017 Challenge show great advantages of our DeepRS that outperforms the existing state-of-the-art models."



Paperid:826
Authors:Kailai Zhou, Linsen Chen, Xun Cao
Title: Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems
Abstract:
Multispectral pedestrian detection is capable of adapting to insufficient illumination conditions by leveraging color-thermal modalities. On the other hand, it is still lacking of in-depth insights on how to fuse the two modalities effectively. Compared with traditional pedestrian detection, we find multispectral pedestrian detection suffers from modality imbalance problems which will hinder the optimization process of dual-modality network and depress the performance of detector. Inspired by this observation, we propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner. Firstly, we design a novel Differential Modality Aware Fusion (DMAF) module to make the two modalities complement each other. Secondly, an illumination aware feature alignment module selects complementary features according to the illumination conditions and aligns the two modality features adaptively. Extensive experimental results demonstrate MBNet outperforms the state-of-the-arts on both the challenging KAIST and CVC-14 multispectral pedestrian datasets in terms of the accuracy and the computational efficiency."



Paperid:827
Authors:Yu Zeng, Zhe Lin, Jimei Yang, Jianming Zhang, Eli Shechtman, Huchuan Lu
Title: High-Resolution Image Inpainting with Iterative Confidence Feedback and Guided Upsampling
Abstract:
Existing image inpainting methods often produce artifacts when dealing with large holes in real applications. To address this challenge, we propose an iterative inpainting method with a feedback mechanism. Specifically, we introduce a deep generative model which not only outputs an inpainting result but also a corresponding confidence map. Using this map as feedback, it progressively fills the hole by trusting only high-confidence pixels inside the hole at each iteration and focuses on the remaining pixels in the next iteration. As it reuses partial predictions from the previous iterations as known pixels, this process gradually improves the result. In addition, we propose a guided upsampling network to enable generation of high-resolution inpainting results. We achieve this by extending the Contextual Attention module to borrow high-resolution feature patches in the input image. Furthermore, to mimic real object removal scenarios, we collect a large object mask dataset and synthesize more realistic training data that better simulates user inputs. Experiments show that our method significantly outperforms existing methods in both quantitative and qualitative evaluations. "



Paperid:828
Authors:Devesh Walawalkar, Zhiqiang Shen, Marios Savvides
Title: Online Ensemble Model Compression using Knowledge Distillation
Abstract:
This paper presents a novel knowledge distillation based model compression framework consisting of a student ensemble. It enables distillation of simultaneously learnt ensemble knowledge onto each of the compressed student models. Each model learns unique representations from the data distribution due to its distinct architecture. This helps the ensemble generalize better by combining every model’s knowledge. The distilled students and ensemble teacher are trained simultaneously without requiring any pretrained weights. Moreover, our proposed method can deliver multi-compressed students with single training, which is efficient and flexible for different scenarios. We provide comprehensive experiments using state-of-the-art classification models to validate our framework’s effectiveness. Notably, using our framework a 97% compressed ResNet110 student model managed to produce a 10.64% relative accuracy gain over its individual baseline training on CIFAR100 dataset. Similarly a 95% compressed DenseNet-BC (k=12) model managed a 8.17% relative accuracy gain."



Paperid:829
Authors:Kang Il Lee, Jung Ho Jeon, Byung Cheol Song
Title: Deep Learning-based Pupil Center Detection for Fast and Accurate Eye Tracking System
Abstract:
In augmented reality (AR) or virtual reality (VR) systems, eye tracking is a key technology and requires significant accuracy as well as real-time operation. Many techniques for detecting pupil centers with error range of iris radius have been developed, but few techniques have precise performance with error range of pupil radius. In addition, the conventional methods rarely guarantee real-time pupil center detection in a general-purpose computer environment due to high complexity. Thus, we propose more accurate pupil center detection by improving the representation quality of the network in charge of pupil center detection. This is realized by representation learning based on mutual information. Also, the latency of the entire system is greatly reduced by using non-local block and self-attention block with large receptive field, which makes it accomplish real-time operation. The proposed system not only shows real-time performance of 52 FPS in a general-purpose computer environment but also provides state-of-the-art accuracy in terms of fine level index of 96.71%, 99.84% and 96.38% for BioID, GI4E and Talking Face Video datasets, respectively."



Paperid:830
Authors:Zhi-Gang Liu, Matthew Mattina
Title: Efficient Residue Number System Based Winograd Convolution
Abstract:
Prior research has shown that Winograd algorithm can reduce the computational complexity of convolutional neural networks (CNN) with weights and activations represented in floating point. However it is difficult to apply the scheme to the inference of low-precision quantized (e.g. INT8) networks. Our work extends the Winograd algorithm to Residue Number System (RNS). The minimal complexity convolution is computed precisely over large transformation tile (e.g. 10x10 to 16x16) of filters and activation patches using the Winograd transformation and low cost (e.g. 8-bit) arithmetic without degrading the prediction accuracy of the networks during inference. The arithmetic complexity reduction is up to 7.03x while the performance improvement is up to 2.30x to 4.69x. for 3x3 and 5x5 filters respectively.



Paperid:831
Authors:Shuai Jia, Chao Ma, Yibing Song, Xiaokang Yang
Title: Robust Tracking against Adversarial Attacks
Abstract:
While deep convolutional neural networks (CNNs) are vulnerable to adversarial attacks, considerably few efforts have been paid to construct robust deep tracking algorithms against adversarial attacks. Current studies on adversarial attack and defense mainly rest in single images. In this work, we first attempt to generate adversarial examples on top of video sequences to improve the tracking robustness against adversarial attacks. To this end, we take temporal motion into consideration when generating lightweight perturbations over the estimated tracking results frame-by-frame. On one hand, we add the temporal perturbations into the original video sequences as adversarial examples to greatly degrade the tracking performance. On the other hand, we sequentially estimate the perturbations from input sequences and learn to eliminate their effect for performance restoration. We apply the proposed adversarial attack and defense approaches to state-of-the-art deep tracking algorithms. Extensive evaluations on the benchmark datasets demonstrate that the proposed defense method not only eliminates the large performance drops caused by adversarial attacks, but also achieves additional performance gains when deep trackers are not under adversarial attacks."



Paperid:832
Authors:Shen Sang, Manmohan Chandraker
Title: Single-Shot Neural Relighting and SVBRDF Estimation
Abstract:
We present a novel physically-motivated deep network for joint shape and material estimation, as well as relighting under novel illumination conditions, using a single image captured by a mobile phone camera. Our physically-based modeling leverages a deep cascaded architecture trained on a large-scale synthetic dataset that consists of complex shapes with microfacet SVBRDF. In contrast to prior works that train rendering layers subsequent to inverse rendering, we propose deep feature sharing and joint training that transfer insights across both tasks, to achieve significant improvements in both reconstruction and relighting. We demonstrate in extensive qualitative and quantitative experiments that our network generalizes very well to real images, achieving high-quality shape and material estimation, as well as image-based relighting. Code, models and data will be publicly released."



Paperid:833
Authors:Qiang Nie , Ziwei Liu , Yunhui Liu
Title: Unsupervised 3D Human Pose Representation with Viewpoint and Pose Disentanglement
Abstract:
Learning a good 3D human pose representation is important for human pose related tasks, e.g. human 3D pose estimation and action recognition. Within all these problems, preserving the intrinsic pose information and adapting to view variations are two critical issues. In this work, we propose a novel Siamese denoising autoencoder to learn a 3D pose representation by disentangling the pose-dependent and view-dependent feature from the human skeleton data, in a fully unsupervised manner. These two disentangled features are utilized together as the representation of the 3D pose. To consider both the kinematic and geometric dependencies, a sequential bidirectional recursive network (SeBiReNet) is further proposed to model the human skeleton data. Extensive experiments demonstrate that the learned representation 1) preserves the intrinsic information of human pose, 2) shows good transferability across datasets and tasks. Notably, our approach achieves state-of-the-art performance on two inherently different tasks: pose denoising and unsupervised action recognition. Code and models are available at: https://github.com/NIEQiang001/unsupervised-human-pose.git."



Paperid:834
Authors:Yiming Hu, Yuding Liang, Zichao Guo, Ruosi Wan, Xiangyu Zhang, Yichen Wei, Qingyi Gu, Jian Sun
Title: Angle-based Search Space Shrinking for Neural Architecture Search
Abstract:
In this work, we present a simple and general search space shrinking method, called Angle-Based search space Shrinking (ABS), for Neural Architecture Search (NAS). Our approach progressively simplifies the original search space by dropping unpromising candidates, thus can reduce difficulties for existing NAS methods to find superior architectures. In particular, we propose an angle-based metric to guide the shrinking process. We provide comprehensive evidences showing that, in weight-sharing supernet, the proposed metric is more stable and accurate than accuracy-based and magnitude-based metrics to predict the capability of child models. We also show that the angle-based metric can converge fast while training supernet, enabling us to get promising shrunk search spaces efficiently. ABS can easily apply to most of NAS approaches (e.g. SPOS, FairNAS, ProxylessNAS, DARTS and PDARTS). Comprehensive experiments show that ABS can dramatically enhance existing NAS approaches by providing a promising shrunk search space."



Paperid:835
Authors:Xiaoyu Yue, Zhanghui Kuang, Chenhao Lin, Hongbin Sun, Wayne Zhang
Title: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
Abstract:
The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed RobustScanner, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios."



Paperid:836
Authors:Jianzhu Guo, Xiangyu Zhu, Yang Yang, Fan Yang, Zhen Lei, Stan Z. Li
Title: Towards Fast, Accurate and Stable 3D Dense Face Alignment
Abstract:
Accurate and Stable 3D Dense Face Alignment","Existing methods of 3D dense face alignment mainly concentrate on accuracy, thus limiting the scope of their practical applications. In this paper, we propose a novel regression framework which makes a balance among speed, accuracy and stability. Firstly, on the basis of a lightweight backbone, we propose a meta-joint optimization strategy to dynamically regress a small set of 3DMM parameters, which greatly enhances speed and accuracy simultaneously. To further improve the stability on videos, we present a virtual synthesis method to transform one still image to a short-video which incorporates in-plane and out-of-plane face moving. On the premise of high accuracy and stability, our model runs at over 50fps on a single CPU core and outperforms other state-of-the-art heavy models simultaneously. Experiments on several challenging datasets validate the efficiency of our method. The code and models will be available at \url{https://github.com/cleardusk/3DDFA_V2}."



Paperid:837
Authors:Tai-Yin Chiu, Danna Gurari
Title: Iterative Feature Transformation for Fast and Versatile Universal Style Transfer
Abstract:
The general framework for fast universal style transfer consists of an autoencoder and a feature transformation at the bottleneck. We propose a new transformation that iteratively stylizes features with analytical gradient descent. Experiments show this transformation is advantageous in part because it is fast. With control knobs to balance content preservation and style effect transferal, we also show this method can switch between artistic and photo-realistic style transfers and reduce distortion and artifacts. Finally, we show it can be used for applications requiring spatial control and multiple-style transfer. "



Paperid:838
Authors:Xin Chen, Yawen Duan, Zewei Chen, Hang Xu, Zihao Chen, Xiaodan Liang, Tong Zhang, Zhenguo Li
Title: CATCH: Context-based Meta Reinforcement Learning for Transferrable Architecture Search
Abstract:
Neural Architecture Search (NAS) achieved many breakthroughs in recent years. In spite of its remarkable progress, many algorithms are restricted to particular search spaces. They also lack efficient mechanisms to reuse knowledge when confronting multiple tasks. These challenges preclude their applicability, and motivate our proposal of CATCH, a novel Context-bAsed meTa reinforcement learning (RL) algorithm for transferrable arChitecture searcH. The combination of meta-learning and RL allows CATCH to efficiently adapt to new tasks while being agnostic to search spaces. CATCH utilizes a probabilistic encoder to encode task properties into latent context variables, which then guide CATCH' s controller to quickly ""catch"" top-performing networks. The contexts also assist a network evaluator in filtering inferior candidates and speed up learning. Extensive experiments demonstrate CATCH' s universality and search efficiency over many other widely-recognized algorithms. It is also capable of handling cross-domain architecture search as competitive networks on ImageNet, COCO, and Cityscapes are identified. This is the first work to our knowledge that proposes an efficient transferrable NAS solution while maintaining robustness across various settings."



Paperid:839
Authors:Tan Yu, Yunfeng Cai, Ping Li
Title: Toward Faster and Simpler Matrix Normalization via Rank-1 Update
Abstract:
Bilinear pooling has achieved an impressive improvement in many computer vision tasks. Recent studies discover that matrix normalization is vital for improving the performance of bilinear pooling. Nevertheless, traditional matrix square-root or logarithm normalization methods need singular value decomposition (SVD), which is not well suited in the GPU platform, limiting its efficiency in training and inference. To boost the efficiency in the GPU platform, recent methods rely on Newton-Schulz (NS) iteration to approximate the matrix square-root. Despite NS iteration is GPU-friendly, it takes several times matrix-matrix multiplications, which is still expensive in computation. Besides, the proposed RUN is much simpler than NS iteration and thus much easier for implementation in practice. Meanwhile, NS iteration can not well support the compact bilinear feature obtained from tensor sketch or random projection. To overcome these limitations, we propose a rank-1 update normalization (RUN), which only needs matrix-vector multiplications and thus is significantly more efficient than NS iteration using matrix-matrix multiplications. Moreover, our RUN readily supports the normalization on compact bilinear features, making it more flexible to be deployed compared with NS iteration. The proposed RUN is differentiable and can be plugged in a neural network for an end-to-end training. Comprehensive experiments on four public benchmarks show that, for the full bilinear pooling, the proposed RUN achieves comparable accuracies with a $330 imes$ speedup over NS iteration. For the compact bilinear pooling, our RUN achieves comparable accuracies with a $5400 imes$ speedup over SVD-based normalization."



Paperid:840
Authors:Yuhi Kondo, Taishi Ono, Legong Sun, Yasutaka Hirasawa, Jun Murayama
Title: Accurate Polarimetric BRDF for Real Polarization Scene Rendering
Abstract:
Polarization has been used to solve a lot of computer vision tasks such as Shape from Polarization (SfP). But existing methods suffer from ambiguity problems of polarization. To overcome such problems, some research works have suggested to use Convolutional Neural Network (CNN). But acquiring large scale dataset with polarization information is a very difficult task. If there is an accurate model which can describe a complicated phenomenon of polarization, we can easily produce synthetic polarized images with various situations to train CNN. In this paper, we propose a new polarimetric BRDF (pBRDF) model. We prove its accuracy by fitting our model to measured data with variety of light and camera conditions. We render polarized images using this model and use them to estimate surface normal. Experiments show that the CNN trained by our polarized images has more accuracy than one trained by RGB only."



Paperid:841
Authors:Ilya Reshetouski, Hideki Oyaizu, Kenichiro Nakamura, Ryuta Satoh, Suguru Ushiki, Ryuichi Tadano, Atsushi Ito, Jun Murayama
Title: Lensless Imaging with Focusing Sparse URA Masks in Long-Wave Infrared and its Application for Human Detection
Abstract:
We introduce a lensless imaging framework for contemporary computer vision applications in long-wavelength infrared (LWIR). The framework consists of two parts: a novel lensless imaging method that utilizes the idea of local directional focusing for optimal binary sparse coding, and lensless imaging simulator based on Fresnel-Kirchhoff diffraction approximation. Our lensless imaging approach, besides being computationally efficient, is calibration-free and allows for wide FOV imaging. We employ our lensless imaging simulation software for optimizing reconstruction parameters and for synthetic images generation for CNN training. We demonstrate the advantages of our framework on a dual-camera system (RGB-LWIR lensless), where we perform CNN-based human detection using the fused RGB-LWIR data."



Paperid:842
Authors:Xiaoyu Tao, Xinyuan Chang, Xiaopeng Hong, Xing Wei, Yihong Gong
Title: Topology-Preserving Class-Incremental Learning
Abstract:
A well-known issue for class-incremental learning is the catastrophic forgetting phenomenon, where the network's recognition performance on old classes degrades severely when incrementally learning new classes. To alleviate forgetting, we put forward to preserve the old class knowledge by maintaining the topology of the network's feature space. On this basis, we propose a novel topology-preserving class-incremental learning (TPCIL) framework. TPCIL uses an elastic Hebbian graph (EHG) to model the feature space topology, which is constructed with the competitive Hebbian learning rule. To maintain the topology, we develop the topology-preserving loss (TPL) that penalizes the changes of EHG's neighborhood relationships during incremental learning phases. Comprehensive experiments on CIFAR100, ImageNet, and subImageNet datasets demonstrate the power of the TPCIL for continuously learning new classes with less forgetting. The code will be released."



Paperid:843
Authors:Xiaolin Zhang, Yunchao Wei, Yi Yang
Title: Inter-Image Communication for Weakly Supervised Localization
Abstract:
Weakly supervised localization aims at finding target object regions using only image-level supervision. However, localization maps extracted from classification networks are often not accurate due to the lack of fine pixel-level supervision. In this paper, we propose to leverage pixel-level similarities across different objects for learning more accurate object locations in a complementary way. Particularly, two kinds of constraints are proposed to prompt the consistency of object features within the same categories. The first constraint is to learn the stochastic feature consistency among discriminative pixels that are randomly sampled from different images within a batch. The discriminative information embedded in one image can be leveraged to benefit its counterpart with inter-image communication. The second constraint is to learn the global consistency of object features throughout the entire dataset. We learn a feature center for each category and realize the global feature consistency by forcing the object features to approach class-specific centers. The global centers are actively updated with the training process. The two constraints can benefit each other to learn consistent pixel-level features within the same categories, and finally improve the quality of localization maps. We conduct extensive experiments on two popular benchmarks, i.e., ILSVRC and CUB-200-2011. Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set, surpassing the current state-of-the-art method by a large margin. The code is available at https://github.com/xiaomengyc/I2C."



Paperid:844
Authors:Zhongzheng Ren, Zhiding Yu, Xiaodong Yang, Ming-Yu Liu, Alexander G. Schwing, Jan Kautz
Title: UFO²: A Unified Framework towards Omni-supervised Object Detection
Abstract:
Existing work on object detection often relies on a single form of annotation: the model is trained using either accurate yet costly bounding boxes or cheaper but less expressive image-level tags. However, real-world annotations are often diverse in form, which challenges these existing works. In this paper, we present UFO$^2$, a unified object detection framework that can handle different forms of supervision simultaneously. Specifically, UFO$^2$ incorporates strong supervision (e.g., boxes), various forms of partial supervision (e.g., class tags, points, and scribbles), and unlabeled data. Through rigorous evaluations, we demonstrate that each form of label can be utilized to either train a model from scratch or to further improve a pre-trained model. We also use UFO$^2$ to investigate budget-aware omni-supervised learning, i.e., various annotation policies are studied under a fixed annotation budget: we show that competitive performance needs no strong labels for all data. Finally, we demonstrate the generalization of UFO$^2$, detecting more than 1,000 different objects without bounding box annotations. Code, models, and more details are available on the project page: https://github.com/NVlabs/wetectron."



Paperid:845
Authors:Dahuin Jung, Jonghyun Lee, Jihun Yi, Sungroh Yoon
Title: iCaps: An Interpretable Classifier via Disentangled Capsule Networks
Abstract:
We propose an interpretable Capsule Network, iCaps, for image classification. A capsule is a group of neurons nested inside each layer, and the one in the last layer is called a class capsule, which is a vector whose norm indicates a predicted probability for the class. Using the class capsule, existing Capsule Networks already provide some level of interpretability. However, there are two limitations which degrade its interpretability: 1) the class capsule also includes classification-irrelevant information, and 2) entities represented by the class capsule overlap. In this work, we address these two limitations using a novel class-supervised disentanglement algorithm and an additional regularizer, respectively. Through quantitative and qualitative evaluations on three datasets, we demonstrate that the resulting classifier, iCaps, provides a prediction along with clear rationales behind it with no performance degradation."



Paperid:846
Authors:Ethan Weber, Nuria Marzo, Dim P. Papadopoulos, Aritro Biswas, Agata Lapedriza, Ferda Ofli, Muhammad Imran, Antonio Torralba
Title: Detecting Natural Disasters, Damage, and Incidents in the Wild
Abstract:
damage, and incidents in the wild","Responding to natural disasters, such as earthquakes, floods, and wildfires, is a laborious task performed by on-the-ground emergency responders and analysts. Social media has emerged as a low-latency data source to quickly understand disaster situations. While most studies on social media are limited to text, images offer more information for understanding disaster and incident scenes. However, no large-scale image datasets for incident detection exists. In this work, we present the Incidents Dataset, which contains 446,684 images annotated by humans that cover 43 incidents across a variety of scenes. We employ a baseline classification model that mitigates false-positive errors and we perform image filtering experiments on millions of social media images from Flickr and Twitter. Through these experiments, we show how the Incidents Dataset can be used to detect images with incidents in the wild. Code, data, and models are available online at http://incidentsdataset.csail.mit.edu."



Paperid:847
Authors:Yinpeng Chen, Xiyang Dai, Mengchen Liu, Dongdong Chen, Lu Yuan, Zicheng Liu
Title: Dynamic ReLU
Abstract:
Rectified linear units (ReLU) are commonly used in deep neural networks. So far ReLU and its generalizations (non-parametric or parametric) are static, performing identically for all input samples. In this paper, we propose dynamic ReLU (DY-ReLU), a dynamic rectifier of which parameters are generated by a hyper function over all in-put elements. The key insight is that DY-ReLU encodes the global context into the hyper function, and adapts the piecewise linear activation function accordingly. Compared to its static counterpart, DY-ReLU has negligible extra computational cost, but significantly more representation capability, especially for light-weight neural networks. By simply using DY-ReLU for MobileNetV2, the top-1 accuracy on ImageNet classification is boosted from 72.0% to 76.2% with only 5% additional FLOPs."



Paperid:848
Authors:Kohei Sakai, Keita Takahashi, Toshiaki Fujii, Hajime Nagahara
Title: Acquiring Dynamic Light Fields through Coded Aperture Camera
Abstract:
We investigate the problem of compressive acquisition of a dynamic light field. A promising solution for compressive light field acquisition is to use a coded aperture camera, with which an entire light field can be computationally reconstructed from several images captured through differently-coded aperture patterns. With this method, it was assumed that the scene should not move throughout the complete acquisition process, which restricted real applications. In this study, however, we assume that the target scene may change over time, and propose a method for acquiring a dynamic light field (a moving scene) using a coded aperture camera and a convolutional neural network (CNN). To successfully handle scene motions, we develop a new configuration of image observation, called V-shape observation, and train the CNN using a dynamic-light-field dataset with pseudo motions. Our method is validated through experiments using both a computer-generated scene and a real camera."



Paperid:849
Authors:Chi Xu, Yasushi Makihara, Xiang Li, Yasushi Yagi, Jianfeng Lu
Title: Gait Recognition from a Single Image using a Phase-Aware Gait Cycle Reconstruction Network
Abstract:
We propose a method of gait recognition just from a single image for the first time, which enables latency-free gait recognition. To mitigate large intra-subject variations caused by a phase (gait pose) difference between a matching pair of input single images, we first reconstruct full gait cycles of image sequences from the single images using an auto-encoder framework, and then feed them into a state-of-the-art gait recognition network for matching. Specifically, a phase estimation network is introduced for the input single image, and the gait cycle reconstruction network exploits the estimated phase to mitigate the dependence of an encoded feature on the phase of that single image. This is called phase-aware gait cycle reconstructor (PA-GCR). In the training phase, the PA-GCR and recognition network are simultaneously optimized to achieve a good trade-off between reconstruction and recognition accuracies. Experiments on three gait datasets demonstrate the significant performance improvement of this method."



Paperid:850
Authors:Jie Cao, Huaibo Huang, Yi Li, Ran He, Zhenan Sun
Title: Informative Sample Mining Network for Multi-Domain Image-to-Image Translation
Abstract:
The performance of multi-domain image-to-image translation has been significantly improved by recent progress in deep generative models. Existing approaches can use a unified model to achieve translations between all the visual domains. However, their outcomes are far from satisfying when there are large domain variations. In this paper, we reveal that improving the sample selection strategy is an effective solution. To select informative samples, we dynamically estimate sample importance during the training of Generative Adversarial Networks, presenting Informative Sample Mining Network. We theoretically analyze the relationship between the sample importance and the prediction of the global optimal discriminator. Then a practical importance estimation function for general conditions is derived. In addition, we propose a novel multi-stage sample training scheme to reduce sample hardness while preserving sample informativeness. Extensive experiments on a wide range of specific image-to-image translation tasks are conducted, and the results demonstrate our superiority over current state-of-the-art methods."



Paperid:851
Authors:Yuke Zhu, Yan Bai, Yichen Wei
Title: Spherical Feature Transform for Deep Metric Learning
Abstract:
Data augmentation in feature space is effective to increase data diversity. Previous methods assume that different classes have the same covariance in their feature distributions. Thus, feature transform between different classes is performed via translation. However, this approach is no longer valid for recent deep metric learning scenarios, where feature normalization is widely adopted and all features lie on a hypersphere. This work proposes a novel spherical feature transform approach. It relaxes the assumption of identical covariance between classes to an assumption of similar covariances of different classes on a hypersphere. Consequently, the feature transform is performed by a rotation that respects the spherical data distributions. We provide a simple and effective training method, and in depth analysis on the relation between the two different transforms. Comprehensive experiments on various deep metric learning benchmarks and different baselines verify that our method achieves consistent performance improvement and state-of-the-art results."



Paperid:852
Authors:Ruixue Tang, Chao Ma, Wei Emma Zhang, Qi Wu, Xiaokang Yang
Title: Semantic Equivalent Adversarial Data Augmentation for Visual Question Answering
Abstract:
Visual Question Answering (VQA) has achieved great success thanks to the fast development of deep neural networks (DNN). On the other hand, the data augmentation, as one of the major tricks for DNN, has been widely used in many computer vision tasks. However, there are few works studying the data augmentation problem for VQA and none of the existing image based augmentation schemes (such as rotation and flipping) can be directly applied to VQA due to its semantic structure -- an $\langle image, question, answer angle$ triplet needs to be maintained correctly. For example, a direction related Question-Answer (QA) pair may not be true if the associated image is rotated or flipped. In this paper, instead of directly manipulating images and questions, we use generated adversarial examples for both images and questions as the augmented data. The augmented examples do not change the visual properties presented in the image as well as the extbf{semantic} meaning of the question, the correctness of the $\langle image, question, answer angle$ is thus still maintained. We then use adversarial learning to train a classic VQA model (BUTD) with our augmented data. We find that we not only improve the overall performance on VQAv2, but also can withstand adversarial attack effectively, compared to the baseline model. The source code is available at https://github.com/zaynmi/seada-vqa."



Paperid:853
Authors:Ran Song, Wei Zhang, Yitian Zhao, Yonghuai Liu
Title: Unsupervised Multi-View CNN for Salient View Selection of 3D Objects and Scenes
Abstract:
We present an unsupervised 3D deep learning framework based on a ubiquitously true proposition named by us view-object consistency as it states that a 3D object and its projected 2D views always belong to the same object class. To validate its effectiveness, we design a multi-view CNN instantiating it for the salient view selection of 3D objects, which quintessentially cannot be handled by supervised learning due to the difficulty of collecting sufficient and consistent training data. Our unsupervised multi-view CNN branches off two channels which encode the knowledge within each 2D view and the 3D object respectively and also exploits both intra-view and inter-view knowledge of the object. It ends with a new loss layer which formulates the view-object consistency by impelling the two channels to generate consistent classification outcomes. We evaluate our method both qualitatively and quantitatively, demonstrating its superiority over several state-of-the-art methods. In addition, we showcase that our method can be used to select salient views of 3D scenes containing multiple objects."



Paperid:854
Authors:Yujie Zhong, Zelu Deng, Sheng Guo, Matthew R. Scott, Weilin Huang
Title: Representation Sharing for Fast Object Detector Search and Beyond
Abstract:
Region Proposal Network (RPN) provides strong support for handling the scale variation of objects in two-stage object detection. For one-stage detectors which do not have RPN, it is more demanding to have powerful sub-networks capable of directly capturing objects of unknown sizes. To enhance such capability, we propose an extremely efficient neural architecture search method, named Fast And Diverse (FAD), to better explore the optimal configuration of receptive fields and con-volution types in the sub-networks for one-stage detectors. FAD consists of a designed search space and an efficient architecture search algorithm. The search space contains a rich set of diverse transformations designed specifically for object detection. To cope with the designed search space, a novel search algorithm termed Representation Sharing (RepShare) is proposed to effectively identify the best combinations of the defined transformations. In our experiments, FAD obtains prominent improvements on two types of one-stage detectors with various backbones. In particular, our FAD detector achieves 46.4 AP on MS-COCO (under single-scale testing), outperforming the state-of-the-art detectors, including the most recent NAS-based detectors, Auto-FPN [42] (searched for 16 GPU-days) and NAS-FCOS [39] (28 GPU-days), while significantly reduces the search cost to 0.6 GPU-days. Beyond object detection, we further demonstrate the generality of FAD on the more challenging instance segmentation, and expect it to benefit more tasks."



Paperid:855
Authors:Lingteng Qiu, Xuanye Zhang, Yanran Li, Guanbin Li, Xiaojun Wu, Zixiang Xiong, Xiaoguang Han, Shuguang Cui
Title: Peeking into occluded joints: A novel framework for crowd pose estimation
Abstract:
Although occlusion widely exists in nature and remains a fundamental challenge for pose estimation, existing heatmap-based approaches suffer serious degradation on occlusions. Their intrinsic problem is that they directly localize the joints based on visual information; however, the invisible joints are lack of that. In contrast to localization, our framework estimates the invisible joints from an inference perspective by proposing an Image-Guided Progressive GCN module which provides a comprehensive understanding of both image context and pose structure. Moreover, existing benchmarks contain limited occlusions for evaluation. Therefore, we thoroughly pursue this problem and propose a novel OPEC-Net framework together with a new Occluded Pose (OCPose) dataset with 9k annotated images. Extensive quantitative and qualitative evaluations on benchmarks demonstrate that OPEC-Net achieves significant improvements over recent leading works. Notably, our OCPose is the most complex occlusion dataset with respect to average IoU between adjacent instances. Source code and OCPose will be publicly available."



Paperid:856
Authors:Linxi Fan, Shyamal Buch, Guanzhi Wang, Ryan Cao, Yuke Zhu, Juan Carlos Niebles, Li Fei-Fei
Title: RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition
Abstract:
Video action recognition is a complex task dependent on modeling spatial and temporal context. Standard approaches rely on 2D or 3D convolutions to process such context, resulting in expensive operations with millions of parameters. Recent efficient architectures leverage a channel-wise shift-based primitive as a replacement for temporal convolutions, but remain bottlenecked by spatial convolution operations to maintain strong accuracy and a fixed-shift scheme. Naively extending such developments to a 3D setting is a difficult, intractable goal. To this end, we introduce RubiksNet, a new efficient architecture for video action recognition which is based on a proposed learnable 3D spatiotemporal shift operation instead. We analyze the suitability of our new primitive for video action recognition and explore several novel variations of our approach to enable stronger representational flexibility while maintaining an efficient design. We benchmark our approach on several standard video recognition datasets, and observe that our method achieves comparable or better accuracy than prior work on efficient video action recognition at a fraction of the performance cost, with 2.9-5.9x fewer parameters and 2.1-3.7x fewer FLOPs. We also perform a series of controlled ablation studies to verify our significant boost in the efficiency-accuracy tradeoff curve is rooted in the core contributions of our RubiksNet architecture. "



Paperid:857
Authors:Ziwei Wang, Quan Zheng, Jiwen Lu, Jie Zhou
Title: Deep Hashing with Active Pairwise Supervision
Abstract:
n this paper, we propose a Deep Hashing method with Active Pairwise Supervision(DH-APS). Conventional methods with passive pairwise supervision obtain labeled data for training and require large amount of annotations to reach their full potential, which are not feasible in realistic retrieval tasks. On the contrary, we actively select a small quantity of informative samples for annotation to provide effective pairwise supervision so that discriminative hash codes can be obtained with limited annotation budget. Specifically, we generalize the structural risk minimization principle and obtain three criteria for the pairwise supervision acquisition: uncertainty, representativeness and diversity. Accordingly, samples involved in the following training pairs should be labeled: pairs with most uncertain similarity, pairs that minimize the discrepancy between labeled and unlabeled data, and pairs which are most different from the annotated data, so that the discriminality and generalization ability of the learned hash codes are significantly strengthened. Moreover, our DH-APS can also be employed as a plug-and-play module for semi-supervised hashing methods to further enhance the performance. Experiments demonstrate that the presented DH-APS achieves the accuracy of supervised hashing methods with only $30\%$ labeled training samples and improves the semi-supervised binary codes by a sizable margin."



Paperid:858
Authors:Lichang Chen, Guosheng Lin, Shijie Wang, Qingyao Wu
Title: Graph Edit Distance Reward: Learning to Edit Scene Graph
Abstract:
Scene Graph, as a vital tool to bridge the gap between the language domain and image domain, has been widely adopted in the cross-modality task like VQA. In this paper, we propose a new method to edit the scene graph according to the user instructions, which has never been explored. To be specific, based on the Policy Gradient and Graph Matching algorithm, we propose a Graph Edit Distance Reward to optimize neural symbolic model, in order to learn editing scene graphs as the semantics given by texts. In the context of text-editing image retrieval, we validate the effectiveness of our method in CSS and CRIR dataset. Besides, CRIR is a new synthetic dataset generated by us, and we will publish it soon for future use."



Paperid:859
Authors:Yajie Xing, Jingbo Wang, Gang Zeng
Title: Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing
Abstract:
Depth data provide geometric information that can bring progress in RGB-D scene parsing tasks. Several recent works propose RGB-D convolution operators that construct receptive fields along the depth-axis to handle 3D neighborhood relations between pixels. However, these methods pre-define depth receptive fields by hyperparameters, making them rely on parameter selection. In this paper, we propose a novel operator called malleable 2.5D convolution to learn the receptive field along the depth-axis. A malleable 2.5D convolution has one or more 2D convolution kernels. Our method assigns each pixel to one of the kernels or none of them according to their relative depth differences, and the assigning process is formulated as a differentiable form so that it can be learnt by gradient descent. The proposed operator runs on standard 2D feature maps and can be seamlessly incorporated into pre-trained CNNs. We conduct extensive experiments on two challenging RGB-D semantic segmentation dataset NYUDv2 and Cityscapes to validate the effectiveness and the generalization ability of our method."



Paperid:860
Authors:Chang Shu, Kun Yu, Zhixiang Duan, Kuiyuan Yang
Title: Feature-metric Loss for Self-supervised Learning of Depth and Egomotion
Abstract:
Photometric loss is widely used for self-supervised depth and egomotion estimation. However, the loss landscapes induced by photometric differences are often problematic for optimization, caused by plateau landscapes for pixels in texture-less regions or multiple local minima for less discriminative pixels. In this work, feature-metric loss is proposed and defined on feature representation, where the feature representation is also learned in a self-supervised manner and regularized by both first-order and second-order derivatives to constrain the loss landscapes to form proper convergence basins. Comprehensive experiments and detailed analysis via visualization demonstrate the effectiveness of the proposed feature-metric loss. In particular, our method improves state-of-the-art methods on KITTI from 0.885 to 0.925 measured by $\delta_1$ for depth estimation, and significantly outperforms previous method for visual odometry."



Paperid:861
Authors:Sibei Yang, Guanbin Li, Yizhou Yu
Title: Propagating Over Phrase Relations for One-Stage Visual Grounding
Abstract:
Phrase level visual grounding aims to locate in an image the corresponding visual regions referred to by multiple noun phrases in a given sentence. Its challenge comes not only from large variations in visual contents and unrestricted phrase descriptions but also from unambiguous referrals derived from phrase relational reasoning. In this paper, we propose a linguistic structure guided propagation network for one-stage phrase grounding. It explicitly explores the linguistic structure of the sentence and performs relational propagation among noun phrases under the guidance of the linguistic relations between them. Specifically, we first construct a linguistic graph parsed from the sentence and then capture multimodal feature maps for all the phrasal nodes independently. The node features are then propagated over the edges with a tailor-designed relational propagation module and ultimately integrated for final prediction. Experiments on Flicker30K Entities dataset show that our model outperforms state-of-the-art methods and demonstrate the effectiveness of propagating among phrases with linguistic relations."



Paperid:862
Authors:Yanrui Bin, Xuan Cao, Xinya Chen, Yanhao Ge, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Changxin Gao, Nong Sang
Title: Adversarial Semantic Data Augmentation for Human Pose Estimation
Abstract:
Human pose estimation is the task of localizing body keypoints from still images. The state-of-the-art methods suffer from insufficient examples of challenging cases such as symmetric appearance, heavy occlusion and nearby person. To enlarge the amounts of challenging cases, previous methods augmented images by cropping and pasting image patches with weak semantics, which leads to unrealistic appearance and limited diversity. We instead propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity. Furthermore, we propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration. Given off-the-shelf pose estimation network as discriminator, the generator seeks the most confusing transformation to increase the loss of the discriminator while the discriminator takes the generated sample as input and learns from it. The whole pipeline is optimized in an adversarial manner. State-of-the-art results are achieved on challenging benchmarks."



Paperid:863
Authors:Gernot Riegler, Vladlen Koltun
Title: Free View Synthesis
Abstract:
We present a method for novel view synthesis from input images that are freely distributed around a scene. Our method does not rely on a regular arrangement of input views, can synthesize images for free camera movement through the scene, and works for general scenes with unconstrained geometric layouts. We calibrate the input images via SfM and erect a coarse geometric scaffold via MVS. This scaffold is used to create a proxy depth map for a novel view of the scene. Based on this depth map, a recurrent encoder-decoder network processes reprojected features from nearby views and synthesizes the new view. Our network does not need to be optimized for a given scene. After training on a dataset, it works in previously unseen environments with no fine-tuning or per-scene optimization. We evaluate the presented approach on challenging real-world datasets, including Tanks and Temples, where we demonstrate successful view synthesis for the first time and substantially outperform prior and concurrent work."



Paperid:864
Authors:Ke-Yue Zhang, Taiping Yao, Jian Zhang, Ying Tai, Shouhong Ding, Jilin Li, Feiyue Huang, Haichuan Song, Lizhuang Ma
Title: Face Anti-Spoofing via Disentangled Representation Learning
Abstract:
Face anti-spoofing is crucial to the security of face recognition systems. Previous approaches focus on developing discriminative models based on the features extracted from images, which may be still entangled between spoof patterns and real persons. In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. We also put forward a CNN architecture with the process of disentanglement and combination of low-level and high-level supervision to improve the generalization capabilities. We evaluate our method on public benchmark datasets and extensive experimental results demonstrate the effectiveness of our method against the state-of-the-art competitors. Finally, we further visualize some results to help understand the effect and advantage of disentanglement. "



Paperid:865
Authors:Youcai Zhang, Zhonghao Lan, Yuchen Dai, Fangao Zeng, Yan Bai, Jie Chang, Yichen Wei
Title: Prime-Aware Adaptive Distillation
Abstract:
Knowledge distillation(KD) aims to improve the performance of a student network by mimicing the knowledge from a powerful teacher network. Existing methods focus on studying what knowledge should be transferred and treat all samples equally during training. This paper introduces the adaptive sample weighting to KD. We discover that previous effective hard mining methods are not appropriate for distillation. Furthermore, we propose Prime-Aware Adaptive Distillation (PAD) by the incorporation of uncertainty learning. PAD perceives the prime samples in distillation and then emphasizes their effect adaptively. PAD is fundamentally different from and would refine existing methods with the innovative view of unequal training. For this reason, PAD is versatile and has been applied in various tasks including classification, metric learning, and object detection. With ten teacher-student combinations on six datasets, PAD promotes the performance of existing distillation methods and outperforms recent state-of-the-art methods."



Paperid:866
Authors:Hongduan Tian, Bo Liu, Xiao-Tong Yuan, Qingshan Liu
Title: Meta-Learning with Network Pruning
Abstract:
Meta-learning is a powerful paradigm for few-shot learning. Although with remarkable success witnessed in many applications, the existing optimization based meta-learning models with over-parameterized neural networks have been evidenced to ovetfit on training tasks. To remedy this deficiency, we propose a network pruning based meta-learning approach for overfitting reduction via explicitly controlling the capacity of network. A uniform concentration analysis reveals the benefit of capacity constraint for reducing generalization gap of the proposed meta-learner. We have implemented our approach on top of Reptile assembled with two network pruning routines: Dense-Sparse-Dense (DSD) and Iterative Hard Thresholding (IHT). Extensive experimental results on benchmark datasets with different over-parameterized deep networks demonstrate that our method not only effectively alleviates meta-overfitting but also in many cases improves the overall generalization performance when applied to few-shot classification tasks."



Paperid:867
Authors:Dongsheng Guo, Hongzhi Liu, Haoru Zhao, Yunhao Cheng, Qingwei Song, Zhaorui Gu, Haiyong Zheng, Bing Zheng
Title: Spiral Generative Network for Image Extrapolation
Abstract:
In this paper, motivated by human natural ability to perceive unseen surroundings imaginatively, we propose a novel Spiral Generative Network, SpiralNet, to perform image extrapolation in a spiral manner, which regards extrapolation as an evolution process growing from an input sub-image along a spiral curve to an expanded full image. Our SpiralNet, consisting of ImagineGAN and SliceGAN, disentangles image extrapolation problem into two independent sub-tasks as semantic structure prediction (via ImagineGAN) and contextual detail generation (via SliceGAN), making the whole task more tractable. The design of SliceGAN implicitly harnesses the correlation between generated contents and extrapolating direction, divide-and-conquer while generation-by-parts. Extensive experiments on datasets covering both objects and scenes under different cases show that our method achieves state-of-the-art performance on image extrapolation. We also conduct ablation study to validate efficacy of our design. Our code is available at https://github.com/zhenglab/spiralnet ."



Paperid:868
Authors:Fang Liu, Changqing Zou, Xiaoming Deng, Ran Zuo, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang
Title: SceneSketcher: Fine-Grained Image Retrieval with Scene Sketches
Abstract:
Sketch-based image retrieval (SBIR) has been a popular research topic in recent years. Existing works concentrate on mapping the visual information of sketches and images to a semantic space at the object level. In this paper, for the first time, we study the fine-grained scene-level SBIR problem which aims at retrieving scene images satisfying the user's specific requirements via a freehand scene sketch. We propose a graph embedding based method to learn the similarity measurement between images and scene sketches, which models the multi-modal information, including the size and appearance of objects as well as their layout information, in an effective manner. To evaluate our approach, we collect a dataset based on SketchyCOCO and extend the dataset using Coco-stuff. Comprehensive experiments demonstrate the significant potential of the proposed approach on the application of fine-grained scene-level image retrieval."



Paperid:869
Authors:Junbum Cha, Sanghyuk Chun, Gayoung Lee, Bado Lee, Seonghyeon Kim, Hwalsuk Lee
Title: Few-shot Compositional Font Generation with Dual Memory
Abstract:
Generating a new font library is a very labor-intensive and time-consuming job for glyph-rich scripts. Despite the remarkable success of existing font generation methods, they have significant drawbacks; they require a large number of reference images to generate a new font set, or they fail to capture detailed styles with only a few samples. In this paper, we focus on compositional scripts, a widely used letter system in the world, where each glyph can be decomposed by several components. By utilizing the compositionality of compositional scripts, we propose a novel font generation framework, named Dual Memory-augmented Font Generation Network (DM-Font), which enables us to generate a high-quality font library with only a few samples. We employ memory components and global-context awareness in the generator to take advantage of the compositionality. In the experiments on Korean-handwriting fonts and Thai-printing fonts, we observe that our method generates a significantly better quality of samples with faithful stylization compared to the state-of-the-art generation methods quantitatively and qualitatively. Source code is available at https://github.com/clovaai/dmfont."



Paperid:870
Authors:Yue Qian, Junhui Hou, Sam Kwong, Ying He
Title: PUGeo-Net: A Geometry-centric Network for 3D Point Cloud Upsampling
Abstract:
In this paper, we propose a novel deep neural network based method, called PUGeo-Net, for upsampling 3D point clouds. PUGeo-Net incorporates discrete differential geometry into deep learning elegantly by learning the first and second fundamental forms that are able to fully represent the local geometry unique up to rigid motion. Specifically, we encode the first fundamental form in a 3 $ imes$ 3 linear transformation matrix $\mathbf{T}$ for each input point. Such a matrix approximates the augmented Jacobian matrix of a local parameterization that encodes the intrinsic information and builds a one-to-one correspondence between the 2D parametric domain and the 3D tangent plane, so that we can lift the adaptively distributed 2D samples learned from the input to 3D space. After that, we use the learned second fundamental form to compute a normal displacement for each generated sample and project it to the curved surface. As a by-product, PUGeo-Net can compute normals for the original and generated points, which is highly desired for surface reconstruction algorithms. We evaluate PUGeo-Net on a wide range of 3D models with sharp features and rich geometric details and observe that PUGeo-Net consistently outperforms state-of-the-art methods in terms of both accuracy and efficiency for upsampling factor 4 - 16. We also verify the geometry-centric nature of PUGeo-Net quantitatively. In addition, PUGeo-Net can handle noisy and non-uniformly distributed inputs well, validating its robustness. The code is publicly available at https://github.com/ninaqy/PUGeo."



Paperid:871
Authors:Luca Cavalli, Viktor Larsson, Martin Ralf Oswald, Torsten Sattler, Marc Pollefeys
Title: Handcrafted Outlier Detection Revisited
Abstract:
Local feature matching is a critical part of many computer vision pipelines, including among others Structure-from-Motion, SLAM, and Visual Localization. However, due to limitations in the descriptors, raw matches are often contaminated by a majority of outliers. As a result, outlier detection is a fundamental problem in computer vision and a wide range of approaches, from simple checks based on descriptor similarity to geometric verification, have been proposed over the last decades. In recent years, deep learning-based approaches to outlier detection have become popular. Unfortunately, the corresponding works rarely compare with strong classical baselines. In this paper we revisit handcrafted approaches to outlier filtering. Based on best practices, we propose a hierarchical pipeline for effective outlier detection as well as integrate novel ideas which in sum lead to an efficient and competitive approach to outlier rejection. We show that our approach, although not relying on learning, is more than competitive to both recent learned works as well as handcrafted approaches, both in terms of efficiency and effectiveness. The code is available at https://github.com/cavalli1234/AdaLAM}{https://github.com/cavalli1234/AdaLAM."



Paperid:872
Authors:Luca Cosmo, Giorgia Minello, Michael Bronstein, Luca Rossi, Andrea Torsello
Title: The Average Mixing Kernel Signature
Abstract:
We introduce the Average Mixing Kernel Signature (AMKS), a novel signature for points on non-rigid three-dimensional shapes based on the average mixing kernel and continuous-time quantum walks. The average mixing kernel holds information on the average transition probabilities of a quantum walk between each pair of vertices of the mesh until a time $T$. We define the AMKS by decomposing the spectral contributions of the kernel into several bands, allowing us to limit the influence of noise-dominated high-frequency components and obtain a more descriptive signature. We also show through a perturbation theory analysis of the kernel that choosing a finite stopping time $T$ leads to noise and deformation robustness for the AMKS. We perform an extensive experimental evaluation on two widely used shape matching datasets under varying level of noise, showing that the AMKS outperforms two state-of-the-art descriptors, namely the Heat Kernel Signature (HKS) and the similarly quantum-walk based Wave Kernel Signature (WKS)."



Paperid:873
Authors:Boyi Jiang, Juyong Zhang, Yang Hong, Jinhao Luo, Ligang Liu, Hujun Bao
Title: BCNet: Learning Body and Cloth Shape from A Single Image
Abstract:
In this paper, we consider the problem to automatically reconstruct garment and body shapes from a single near-front view RGB image. To this end, we propose a layered garment representation on top of SMPL and novelly make the skinning weight of garment independent of the body mesh, which significantly improves the expression ability of our garment model. Compared with existing methods, our method can support more garment categories and recover more accurate geometry. To train our model, we construct two large scale datasets with ground truth body and garment geometries as well as paired color images. Compared with single mesh or non-parametric representation, our method can achieve more flexible control with separate meshes, makes applications like re-pose, garment transfer, and garment texture mapping possible."



Paperid:874
Authors:Umer Rafi, Andreas Doering, Bastian Leibe, Juergen Gall
Title: Self-supervised Keypoint Correspondences for Multi-Person Pose Estimation and Tracking in Videos
Abstract:
Video annotation is expensive and time consuming. Consequently, datasets for multi-person pose estimation and tracking are less diverse and have more sparse annotations compared to large scale image datasets for human pose estimation. This makes it challenging to learn deep learning based models for associating keypoints across frames that are robust to nuisance factors such as motion blur and occlusions for the task of multi-person pose tracking. %It is clear that leveraging large scale human pose annotations from single images benefits frame level human pose estimation in videos. This raises the question can large scale human pose annotations from single images benefit muti-person pose tracking. To address this issue, we propose an approach that relies on keypoint correspondences for associating persons in videos. Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation using self-supervision. Combined with a top-down framework for human pose estimation, we use keypoints correspondences to (i) recover missed pose detections (ii) associate pose detections across video frames. Our approach achieves state-of-the-art results for temporal pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets. "



Paperid:875
Authors:Jingwen He, Chao Dong, Yu Qiao
Title: Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration
Abstract:
Interactive image restoration aims to generate restored images by adjusting a controlling coefficient which determines the restoration level. Previous works are restricted in modulating image with a single coefficient. However, real images always contain multiple types of degradation, which cannot be well determined by one coefficient. To make a step forward, this paper presents a new problem setup, called multi-dimension (MD) modulation, which aims at modulating output effects across multiple degradation types and levels. Compared with the previous single-dimension (SD) modulation, the MD is setup to handle multiple degradations adaptively and relief unbalanced learning problem in different degradations. We also propose a deep architecture - CResMD with newly introduced controllable residual connections for multi-dimension modulation. Specifically, we add a controlling variable on the conventional residual connection to allow a weighted summation of input and residual. The values of these weights are generated by another condition network. We further propose a new data sampling strategy based on beta distribution to balance different degradation types and levels. With corrupted image and degradation information as inputs, the network can output the corresponding restored image. By tweaking the condition vector, users can control the output effects in MD space at test time. Extensive experiments demonstrate that the proposed CResMD achieve excellent performance on both SD and MD modulation tasks. Code is available at \url{https://github.com/hejingwenhejingwen/CResMD}."



Paperid:876
Authors:Xubin Zhong, Changxing Ding, Xian Qu, Dacheng Tao
Title: Polysemy Deciphering Network for Human-Object Interaction Detection
Abstract:
Human-Object Interaction (HOI) detection is important in human-centric scene understanding. Existing works typically assume that the same verb in different HOI categories has similar visual characteristics, while ignoring the diverse semantic meanings of the verb. To address this issue, in this paper, we propose a novel Polysemy Deciphering Network (PD-Net), which decodes the visual polysemy of verbs for HOI detection in three ways. First, PD-Net augments human pose and spatial features for HOI detection using language priors, enabling the verb classifiers to receive language hints that reduce the intra-class variation of the same verb. Second, we introduce a novel Polysemy Attention Module (PAM) that guides PD-Net to make decisions based on more important feature types according to the language priors. Finally, the above two strategies are applied to two types of classifiers for verb recognition, i.e., object-shared and object-specific verb classifiers, whose combination further relieves the verb polysemy problem. By deciphering the visual polysemy of verbs, we achieve the best performance on both HICO-DET and V-COCO datasets. In particular, PD-Net outperforms state-of-the-art approaches by 3.81% mAP in the Known-Object evaluation mode of HICO-DET. Code of PD-Net will be released at https://github.com/MuchHair/PD-Net."



Paperid:877
Authors:Arthur Douillard, Matthieu Cord, Charles Ollion, Thomas Robert, Eduardo Valle
Title: PODNet: Pooled Outputs Distillation for Small-Tasks Incremental Learning
Abstract:
Lifelong learning has attracted much attention, but existing works still struggle to fight catastrophic forgetting and accumulate knowledge over long stretches of incremental learning. In this work, we propose PODNet, a model inspired by representation learning. By carefully balancing the compromise between remembering the old classes and learning new ones, PODNet fights catastrophic forgetting, even over very long runs of small incremental tasks, a setting so far unexplored by current works. PODNet innovates on existing art with an efficient spatial-based distillation-loss applied throughout the model and a representation comprising multiple proxy vectors for each class. We validate those innovations thoroughly, comparing PODNet with three state-of-the-art models on three datasets: CIFAR100, ImageNet100, and ImageNet1000. Our results showcase a significant advantage of PODNet over existing art, with accuracy gains of 12.10, 6.51, and 2.85 percentage points, respectively."



Paperid:878
Authors:Francesca Pistilli, Giulia Fracastoro, Diego Valsesia, Enrico Magli
Title: Learning Graph-Convolutional Representations for Point Cloud Denoising
Abstract:
Point clouds are an increasingly relevant data type but they are often corrupted by noise. We propose a deep neural network based on graph-convolutional layers that can elegantly deal with the permutation-invariance problem encountered by learning-based point cloud processing methods. The network is fully-convolutional and can build complex hierarchies of features by dynamically constructing neighborhood graphs from similarity among the high-dimensional feature representations of the points. When coupled with a loss promoting proximity to the ideal surface, the proposed approach significantly outperforms state-of-the-art methods on a variety of metrics. In particular, it is able to improve in terms of Chamfer measure and of quality of the surface normals that can be estimated from the denoised data. We also show that it is especially robust both at high noise levels and in presence of structured noise such as the one encountered in real LiDAR scans."



Paperid:879
Authors:Dongkwon Jin, Jun-Tae Lee,  Chang-Su Kim
Title: Semantic Line Detection Using Mirror Attention and Comparative Ranking and Matching
Abstract:
A novel algorithm to detect semantic lines is proposed in this paper. We develop three networks: detection network with mirror attention (D-Net) and comparative ranking and matching networks (R-Net and M-Net). D-Net extracts semantic lines by exploiting rich contextual information. To this end, we design the mirror attention module. Then, through pairwise comparisons of extracted semantic lines, we iteratively select the most semantic line and remove redundant ones overlapping with the selected one. For the pairwise comparisons, we develop R-Net and M-Net in the Siamese architecture. Experiments demonstrate that the proposed algorithm outperforms the conventional semantic line detector significantly. Moreover, we apply the proposed algorithm to detect two important kinds of semantic lines successfully: dominant parallel lines and reflection symmetry axes."



Paperid:880
Authors:Marco Cannici, Marco Ciccone, Andrea Romanoni , Matteo Matteucci
Title: A Differentiable Recurrent Surface for Asynchronous Event-Based Data
Abstract:
Dynamic Vision Sensors (DVSs) asynchronously stream events in correspondence of pixels subject to brightness changes. Differently from classic vision devices, they produce a sparse representation of the scene. Therefore, to apply standard computer vision algorithms, events need to be integrated into a frame or event-surface. This is usually attained through hand-crafted grids that reconstruct the frame using ad-hoc heuristics. In this paper, we propose Matrix-LSTM, a grid of Long Short-Term Memory (LSTM) cells that efficiently process events and learn end-to-end task-dependent event-surfaces. Compared to existing reconstruction approaches, our learned event-surface shows good flexibility and expressiveness on optical flow estimation on the MVSEC benchmark and it improves the state-of-the-art of event-based object classification on the N-Cars dataset."



Paperid:881
Authors:Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia, Jiyang Xie, Zhanyu Ma , Yi-Zhe Song, Jun Guo
Title: Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches
Abstract:
Fine-grained visual classification (FGVC) is much more challenging than traditional classification tasks due to the inherently subtle intra-class object variations. Recent works mainly tackle this problem by focusing on how to locate the most discriminative parts, more complementary parts, and parts of various granularities. However, less effort has been placed to which granularities are the most discriminative and how to fuse information cross multi-granularity. In this work, we propose a novel framework for fine-grained visual classification to tackle these problems. In particular, we propose: (i) a novel progressive training strategy that adds new layers in each training step to exploit information based on the smaller granularity information found at the last step and the previous stage. (ii) a simple jigsaw puzzle generator to form images contain information of different granularity levels. We obtain state-of-the-art performances on several standard FGVC benchmark datasets, where the proposed method consistently outperforms existing methods or delivers competitive results. Code is provided as part of supplementary material, and will be publicly released upon acceptance."



Paperid:882
Authors:Tak-Wai Hui, Chen Change Loy
Title: LiteFlowNet3: Resolving Correspondence Ambiguity for More Accurate Optical Flow Estimation
Abstract:
Deep learning approaches have achieved great success in addressing the problem of optical flow estimation. The keys to success lie in the use of cost volume and coarse-to-fine flow inference. However, the matching problem becomes ill-posed when partially occluded or homogeneous regions exist in images. This causes a cost volume to contain outliers and affects the flow decoding from it. Besides, the coarse-to-fine flow inference demands an accurate flow initialization. Ambiguous correspondence yields erroneous flow fields and affects the flow inferences in subsequent levels. In this paper, we introduce LiteFlowNet3, a deep network consisting of two specialized modules, to address the above challenges. (1) We ameliorate the issue of outliers in the cost volume by amending each cost vector through an adaptive modulation prior to the flow decoding. (2) We further improve the flow accuracy by exploring local flow consistency. To this end, each inaccurate optical flow is replaced with an accurate one from a nearby position through a novel warping of the flow field. LiteFlowNet3 not only achieves promising results on public benchmarks but also has a small model size and a fast runtime."



Paperid:883
Authors:Valeriya Pronina, Filippos Kokkinos, Dmitry V. Dylov, Stamatios Lefkimmiatis
Title: Microscopy Image Restoration with Deep Wiener-Kolmogorov Filters
Abstract:
Microscopy is a powerful visualization tool in biology, enabling the study of cells, tissues, and the fundamental biological processes; yet, the observed images typically suffer from blur and background noise. In this work, we propose a unifying framework of algorithms for Gaussian image deblurring and denoising. These algorithms are based on deep learning techniques for the design of learnable regularizers integrated into the Wiener-Kolmogorov filter. Our extensive experimentation line showcases that the proposed approach achieves a superior quality of image reconstruction and surpasses the solutions that rely either on deep learning or on optimization schemes alone. Augmented with the variance stabilizing transformation, the proposed reconstruction pipeline can also be successfully applied to the problem of Poisson image deblurring, surpassing the state-of-the-art methods. Moreover, several variants of the proposed framework demonstrate competitive performance at low computational complexity, which is of high importance for real-time imaging applications."



Paperid:884
Authors:Dave Zhenyu Chen, Angel X. Chang, Matthias Nießner
Title: ScanRefer: 3D Object Localization in RGB-D Scans using Natural Language
Abstract:
We introduce the new task of 3D object localization in RGB-D scans using natural language descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free-form description of a specified target object. To address this task, we propose ScanRefer, where the core idea is to learn a fused descriptor from 3D object proposals and encoded sentence embeddings. This learned descriptor then correlates the language expressions with the underlying geometric features of the 3D scan and facilitates the regression of the 3D bounding box of the target object. In order to train and benchmark our method, we introduce a new ScanRefer dataset, containing 46,173 descriptions of 9,943 objects from 703 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D."



Paperid:885
Authors:Zeyu Hu, Mingmin Zhen, Xuyang Bai, Hongbo Fu, Chiew-lan Tai
Title: JSENet: Joint Semantic Segmentation and Edge Detection Network for 3D Point Clouds
Abstract:
Semantic segmentation and semantic edge detection can be seen as two dual problems with close relationships in computer vision. Despite the fast evolution of learning-based 3D semantic segmentation methods, little attention has been drawn to the learning of 3D semantic edge detectors, even less to a joint learning method for the two tasks. In this paper, we tackle the 3D semantic edge detection task for the first time and present a new two-stream fully-convolutional network that jointly performs the two tasks. In particular, we design a joint refinement module that explicitly wires region information and edge information to improve the performances of both tasks. Further, we propose a novel loss function that encourages the network to produce semantic segmentation results with better boundaries. Extensive evaluations on S3DIS and ScanNet datasets show that our method achieves on par or better performance than the state-of-the-art methods for semantic segmentation and outperforms the baseline methods for semantic edge detection. "



Paperid:886
Authors:Hu Zhang, Linchao Zhu, Yi Zhu, Yi Yang
Title: Motion-Excited Sampler: Video Adversarial Attack with Sparked Prior
Abstract:
Deep neural networks are known to be susceptible to adversarial noise, which is tiny and imperceptible perturbation. Most of previous works on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored. In this paper, we aim to attack video models by utilizing intrinsic movement pattern and regional relative motion among video frames. We propose an effective motion-excited sampler to obtain motion-aware noise prior, which we term as sparked prior. Our sparked prior underlines frame correlations and utilizes video dynamics via relative motion. By using the sparked prior in gradient estimation, we can successfully attack a variety of video classification models with fewer number of queries. Extensive experimental results on four benchmark datasets validate the efficacy of our proposed method. "



Paperid:887
Authors:Ishant Shanu, Siddhant Bharti, Chetan Arora, S. N. Maheshwari
Title: An Inference Algorithm for Multi-Label MRF-MAP Problems with Clique Size 100
Abstract:
In this paper, we propose an algorithm for optimal solutions to submodular higher-order multi-label MRF-MAP energy functions which can handle practical computer vision problems with up to 16 labels and cliques of size 100. The algorithm uses a transformation which transforms a multi-label problem to a 2-label problem on a much larger clique. Earlier algorithms based on this transformation could not handle problems larger than 16 labels on cliques of size 4. The proposed algorithm optimizes the resultant 2-label problem using the submodular polyhedron based Min Norm Point algorithm. The task is challenging because the state space of the transformed problem has a very large number of invalid states. For polyhedral based algorithms the presence of invalid states poses a challenge as apart from numerical instability, the transformation also increases the dimension of the polyhedral space making the straightforward use of known algorithms impractical. The approach reported in this paper allows us to bypass the large costs associated with invalid configurations, resulting in a stable, practical, optimal and efficient inference algorithm that, in our experiments, gives high-quality outputs on problems like pixel-wise object segmentation and stereo matching."



Paperid:888
Authors:Baojie Fan, Wei Chen, Yang Cong, Jiandong Tian
Title: Dual Refinement Underwater Object Detection Network
Abstract:
Due to the complex underwater environment, underwater imaging often encounters some problems such as blur, scale variation, color shift, and texture distortion. Generic detection algorithms can not work well when we use them directly in the underwater scene. To address these problems, we propose an underwater detection framework with feature enhancement and anchor refinement. It has a composite connection backbone to boost the feature representation and introduces a receptive field augmentation module to exploit multi-scale contextual features. The developed underwater object detection framework also provides a prediction refinement scheme according to six prediction layers, it can refine multi-scale features to better align with anchors by learning from offsets, which solve the problem of sample imbalance to a certain extent. We also construct a new underwater detection dataset, denoted as UWD, which has more than 10,000 train-val and test underwater images. The extensive experiments on PASCAL VOC and UWD demonstrate the favorable performance of the proposed underwater detection framework against the states-of-the-arts methods in terms of accuracy and robustness. Source code and models are available at: https://github.com/Peterchen111/FERNet."



Paperid:889
Authors:Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin
Title: Multiple Sound Sources Localization from Coarse to Fine
Abstract:
How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations. To solve this problem, we develop a two-stage audiovisual learning framework that disentangles audio and visual representations of different categories from complex scenes, then performs cross-modal feature alignment in a coarse-to-fine manner. Our model achieves state-of-the-art results on public dataset of localization, as well as considerable performance on multi-source sound localization in complex scenes. We then employ the localization results for sound separation and obtain comparable performance to existing methods. These outcomes demonstrate our model's ability in effectively aligning sounds with specific visual sources."



Paperid:890
Authors:Jinyoung Choi, Bohyung Han
Title: Task-Aware Quantization Network for JPEG Image Compression
Abstract:
We propose to learn a deep neural network for JPEG image compression, which predicts image-specific optimized quantization tables fully compatible with the standard JPEG encoder and decoder. Moreover, our approach provides the capability to learn task-specific quantization tables in a principled way by adjusting the objective function of the network. The main challenge to realize this idea is that there exist non-differentiable components in the encoder such as run-length encoding and Huffman coding and it is not straightforward to predict the probability distribution of the quantized image representations. We address these issues by learning a differentiable loss function that approximates bitrates using simple network blocks---two MLPs and an LSTM. We evaluate the proposed algorithm using multiple task-specific losses---two for semantic image understanding and another two for conventional image compression---and demonstrate the effectiveness of our approach to the individual tasks. "



Paperid:891
Authors:Fredrik K. Gustafsson, Martin Danelljan, Goutam Bhat, Thomas B. Schön
Title: Energy-Based Models for Deep Probabilistic Regression
Abstract:
While deep learning-based classification is generally tackled using standardized approaches, a wide variety of techniques are employed for regression. In computer vision, one particularly popular such technique is that of confidence-based regression, which entails predicting a confidence value for each input-target pair $(x, y)$. While this approach has demonstrated impressive results, it requires important task-dependent design choices, and the predicted confidences lack a natural probabilistic meaning. We address these issues by proposing a general and conceptually simple regression method with a clear probabilistic interpretation. In our proposed approach, we create an energy-based model of the conditional target density $p(y | x)$, using a deep neural network to predict the un-normalized density from $(x, y)$. This model of $p(y | x)$ is trained by directly minimizing the associated negative log-likelihood, approximated using Monte Carlo sampling. We perform comprehensive experiments on four computer vision regression tasks. Our approach outperforms direct regression, as well as other probabilistic and confidence-based methods. Notably, our model achieves a $2.2\%$ AP improvement over Faster-RCNN for object detection on the COCO dataset, and sets a new state-of-the-art on visual tracking when applied for bounding box estimation. In contrast to confidence-based methods, our approach is also shown to be directly applicable to more general tasks such as age and head-pose estimation. Code is available at https://github.com/fregu856/ebms_regression."



Paperid:892
Authors:Hugo Bertiche, Meysam Madadi, Sergio Escalera
Title: CLOTH3D: Clothed 3D Humans
Abstract:
We present CLOTH3D, the first big scale synthetic dataset of 3D clothed human sequences. CLOTH3D contains a large variability on garment type, topology, shape, size, tightness and fabric. Clothes are simulated on top of thousands of different pose sequences and body shapes, generating realistic cloth dynamics. We provide the dataset with a generative model for cloth generation. We propose a Conditional Variational Auto-Encoder (CVAE) based on graph convolutions (GCVAE) to learn garment latent spaces. This allows for realistic generation of 3D garments on top of SMPL model for any pose and shape."



Paperid:893
Authors:Kang Zhou, Yuting Xiao, Jianlong Yang, Jun Cheng, Wen Liu, Weixin Luo, Zaiwang Gu, Jiang Liu, Shenghua Gao
Title: Encoding Structure-Texture Relation with P-Net for Anomaly Detection in Retinal Images
Abstract:
Anomaly detection in retinal image refers to the identification of abnormality caused by various retinal diseases/lesions, by only leveraging normal images in training phase. Normal images from healthy subjects often have regular structures (e.g., the structured blood vessels in the fundus image, or structured anatomy in optical coherence tomography image). On the contrary, the diseases and lesions often destroy these structures. Motivated by this, we propose to leverage the relation between the image texture and structure to design a deep neural network for anomaly detection. Specifically, we first extract the structure of the retinal images, then we combine both the structure features and the last layer features extracted from original health image to reconstruct the original input healthy image. The image feature provides the texture information and guarantees the uniqueness of the image recovered from the structure. In the end, we further utilize the reconstructed image to extract the structure and measure the difference between structure extracted from original and the reconstructed image. On the one hand, minimizing the reconstruction difference behaves like a regularizer to guarantee that the image is corrected reconstructed. On the other hand, such structure difference can also be used as a metric for normality measurement. The whole network is termed as P-Net because it has a ""P"" shape. Extensive experiments on RESC dataset and iSee dataset validate the effectiveness of our approach for anomaly detection in retinal images. Further, our method also generalizes well to novel class discovery for retinal images and anomaly detection for real-world images. "



Paperid:894
Authors:Xingping Dong, Jianbing Shen, Ling Shao, Fatih Porikli
Title: CLNet: A Compact Latent Network for Fast Adjusting Siamese Trackers
Abstract:
In this paper, we provide a deep analysis for Siamese-based trackers and find that the one core reason for their failure on challenging cases can be attributed to the problem of {\it decisive samples missing} during offline training. Furthermore, we notice that the samples given in the first frame can be viewed as the decisive samples for the sequence since they contain rich sequence-specific information. To make full use of these sequence-specific samples, {we propose a compact latent network to quickly adjust the tracking model to adapt to new scenes.} A statistic-based compact latent feature is proposed to efficiently capture the sequence-specific information for the fast adjustment. In addition, we design a new training approach based on a diverse sample mining strategy to further improve the discrimination ability of our compact latent network. To evaluate the effectiveness of our method, we apply it to adjust a recent state-of-the-art tracker, SiamRPN++. Extensive experimental results on five recent benchmarks demonstrate that the adjusted tracker achieves promising improvement in terms of tracking accuracy, with almost the same speed. The code and models are available at \url{https://github.com/xingpingdong/CLNet-tracking}."



Paperid:895
Authors:Lu Zhou, Yingying Chen, Yunze Gao, Jinqiao Wang, Hanqing Lu
Title: Occlusion-Aware Siamese Network for Human Pose Estimation
Abstract:
Pose estimation usually suffers from varying degrees of performance degeneration owing to occlusion. To conquer this dilemma, we propose an occlusion-aware siamese network to improve the performance. Specifically, we introduce scheme of feature erasing and reconstruction. Firstly, we utilize attention mechanism to predict the occlusion-aware attention map which is explicitly supervised and clean the feature map which is contaminated by different types of occlusions. Nevertheless, the cleaning procedure not only removes the useless information but also erases some valuable details. To overcome the defects caused by the erasing operation, we perform feature reconstruction to recover the information destroyed by occlusion and details lost in cleaning procedure. To make reconstructed features more precise and informative, we adopt siamese network equipped with OT divergence to guide the features of occluded images towards those of the un-occluded images. Algorithm is validated on MPII, LSP and COCO benchmarks and we achieve promising results."



Paperid:896
Authors:Yufan Liu, Minglang Qiao, Mai Xu, Bing Li, Weiming Hu, Ali Borji
Title: Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model
Abstract:
Recently, video streams have occupied a large proportion of Internet traffic, most of which contain human faces. Hence, it is necessary to predict saliency on multiple-face videos, which can provide attention cues for many content based applications. However, most of multiple-face prediction works only consider visual information and ignore audio, which is not consistent with the naturalistic scenarios. Several behavioral studies have established that sound influences human attention, especially during the speech turn-taking in multiple-face videos. In this paper, we thoroughly investigate such influences by establishing a large-scale eye-tracking database of Multiple-face Video in Visual-Audio condition (MVVA). Inspired by the findings of our investigation, we propose a novel multi-modal video saliency model consisting of three branches: visual, audio and face. The visual branch takes the RGB frames as the input and encodes them into visual feature maps. The audio and face branches encode the audio signal and multiple cropped faces, respectively. A fusion module is introduced to integrate the information from three modalities, and to generate the final saliency map. Experimental results show that the proposed method outperforms 11 state-of-the-art saliency prediction works. It performs closer to human multi-modal attention. "



Paperid:897
Authors:Lizhen Wang, Xiaochen Zhao, Tao Yu, Songtao Wang, Yebin Liu
Title: NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image
Abstract:
We propose NormalGAN, a fast adversarial learning-based method to reconstruct the complete and detailed 3D human from a single RGB-D image. Given a single front-view RGB-D image, NormalGAN performs two steps: front-view RGB-D rectification and back-view RGB-D inference. The final model was then generated by simply combining the front-view and back-view RGB-D information. However, inferring back-view RGB-D image with high-quality geometric details and plausible texture is not trivial. Our key observation is: Normal maps generally encode much more information of 3D surface details than RGB and depth images. Therefore, learning geometric details from normal maps is superior than other representations. In NormalGAN, an adversarial learning framework conditioned by normal maps is introduced, which is used to not only improve the front-view depth denoising performance, but also infer back-view depth image with surprisingly geometric details. Moreover, for texture recovery, we remove shading information from the front-view RGB image based on the refined normal map, which further improves the quality of the back-view color inference. Results and experiments on both testing data set and real captured data demonstrate the superior performance of our approach. Given a consumer RGB-D sensor, NormalGAN can generate the complete and detailed 3D human reconstruction results in 20 fps, which further enables convenient interactive experiences in telepresence, AR/VR and gaming scenarios. "



Paperid:898
Authors:Fabio Pizzati, Pietro Cerri, Raoul de Charette
Title: Model-based occlusion disentanglement for image-to-image translation
Abstract:
Image-to-image translation is affected by entanglement phenomena, which may occur in case of target data encompassing occlusions such as raindrops, dirt, etc. Our unsupervised model-based learning disentangles scene and occlusions, while benefiting from an adversarial pipeline to regress physical parameters of the occlusion model. The experiments demonstrate our method is able to handle varying types of occlusions and generate highly realistic translations, qualitatively and quantitatively outperforming the state-of-the-art on multiple datasets."



Paperid:899
Authors:Yu Zheng, Danyang Zhang, Sinan Xie, Jiwen Lu, Jie Zhou
Title: Rotation-robust Intersection over Union for 3D Object Detection
Abstract:
In this paper, we propose a Rotation-robust Intersection over Union ($ extit{RIoU}$) for 3D object detection, which aims to jointly learn the overlap of rotated bounding boxes. In most existing 3D object detection methods, the norm-based loss is adopted to individually regress the parameters of bounding boxes, which may suffer from the loss-metric mismatch due to the scaling problem. Motivated by the IoU loss in the axis-aligned 2D object detection which is invariant to the scale, our method jointly optimizes the parameters via the $ extit{RIoU}$ loss. To tackle the uncertainty of convex caused by rotation, a projection operation is defined to estimate the intersection area. The calculation process of $ extit{RIoU}$ and its loss function is robust to the rotation condition and feasible for back-propagation, which only comprises basic numerical operations. By incorporating the $ extit{RIoU}$ loss with the conventional norm-based loss function, we enforce the network to directly optimize the $ extit{RIoU}$ . Experimental results on the KITTI and nuScenes datasets validate the effectiveness of our proposed method. Moreover, we show that our method is suitable for the detection task of 2D rotated objects, such as text boxes and cluttered targets in the aerial images. "



Paperid:900
Authors:Yi Huang, Fan Wang, Adams Wai-Kin Kong, Kwok-Yan Lam
Title: New Threats against Object Detector with Non-local Block
Abstract:
The introduction of non-local blocks to the traditional CNN architecture enhances its performance for various computer vision tasks by improving its capabilities of capturing long range dependencies. However, the usage of non-local blocks may also introduce new threats to computer vision systems. Therefore, it is important to study the threats caused by non-local blocks before directly applying them to commercial systems. In this paper, two new threats named disappearing attack and appearing attack against object detectors with a non-local block are investigated. The former aims at misleading an object detector with a non-local block such that it is unable to detect a target object category while the latter aims at misleading the object detector such that it detects a predefined object category, which is not present in images. Different from the existing attacks against object detectors, these threats are able to be performed in long range cases. This means that the target object and the universal adversarial patches learned from the proposed algorithms can have long distance between them. To examine the threats, digital and physical experiments are conducted on Faster R-CNN with a non-local block and 6331 images from 56 videos. The experiments show that the universal patches are able to mislead the detector with greater probabilities. To explain the threats from non-local blocks, the reception fields of CNN models with and without non-local blocks are studied empirically and theoretically."



Paperid:901
Authors:Xinpeng Xie, Jiawei Chen, Yuexiang Li, Linlin Shen, Kai Ma, Yefeng Zheng
Title: Self-Supervised CycleGAN for Object-Preserving Image-to-Image Domain Adaptation
Abstract:
Recent generative adversarial network (GAN) based methods (e.g., CycleGAN) are prone to fail at preserving image-objects in image-to-image translation, which reduces their practicality on tasks such as domain adaptation. Some frameworks have been proposed to adopt a segmentation network as the auxiliary regularization to prevent the content distortion. However, all of them require extra pixel-wise annotations, which is difficult to fulfill in practical applications. In this paper, we propose a novel GAN (namely OP-GAN) to address the problem, which involves a self-supervised module to enforce the image content consistency during image-to-image translations without any extra annotations. We evaluate the proposed OP-GAN on three publicly available datasets. The experimental results demonstrate that our OP-GAN can yield visually plausible translated images and significantly improve the semantic segmentation accuracy in different domain adaptation scenarios with off-the-shelf deep learning networks such as PSPNet and U-Net."



Paperid:902
Authors:Federica Arrigoni, Luca Magri, Tomas Pajdla
Title: On the Usage of the Trifocal Tensor in Motion Segmentation
Abstract:
Motion segmentation, i.e., the problem of clustering data in multiple images based on different 3D motions, is an important task for reconstructing and understanding dynamic scenes. In this paper we address motion segmentation in multiple images by combining partial results coming from triplets of images, which are obtained by fitting a number of trifocal tensors to correspondences. We exploit the fact that the trifocal tensor is a stronger model than the fundamental matrix, as it provides fewer but more reliable matches over three images than fundamental matrices provide over the two. We also consider an alternative solution which merges partial results coming from both triplets and pairs of images, showing the strength of three-frame segmentation in a combination with two-frame segmentation. Our real experiments on standard as well as new datasets demonstrate the superior accuracy of the proposed approaches when compared to previous techniques."



Paperid:903
Authors:Wen Shen, Binbin Zhang, Shikun Huang, Zhihua Wei, Quanshi Zhang
Title: 3D-Rotation-Equivariant Quaternion Neural Networks
Abstract:
This paper proposes a set of rules to revise various neural networks for 3D point cloud processing to rotation-equivariant quaternion neural networks (REQNNs). We find that when a neural network uses quaternion features, the network feature naturally has the rotation-equivariance property. Rotation equivariance means that applying a specific rotation transformation to the input point cloud is equivalent to applying the same rotation transformation to all intermediate-layer quaternion features. Besides, the REQNN also ensures that the intermediate-layer features are invariant to the permutation of input points. Compared with the original neural network, the REQNN exhibits higher rotation robustness."



Paperid:904
Authors:Gyeongsik Moon, Shoou-I Yu, He Wen, Takaaki Shiratori, Kyoung Mu Lee
Title: InterHand2.6M: A Dataset and Baseline for 3D Interacting Hand Pose Estimation from a Single RGB Image
Abstract:
Analysis of hand-hand interactions is a crucial step towards better understanding human behavior. However, most researches in 3D hand pose estimation have focused on the isolated single hand case. Therefore, we firstly propose (1) a large-scale dataset, InterHand2.6M, and (2) a baseline network, InterNet, for 3D interacting hand pose estimation from a single RGB image. The proposed InterHand2.6M consists of extbf{2.6M labeled single and interacting hand frames} under various poses from multiple subjects. Our InterNet simultaneously performs 3D single and interacting hand pose estimation. In our experiments, we demonstrate big gains in 3D interacting hand pose estimation accuracy when leveraging the interacting hand data in InterHand2.6M. We also report the accuracy of InterNet on InterHand2.6M, which serves as a strong baseline for this new dataset. Finally, we show 3D interacting hand pose estimation results from general images. Our code and dataset are available ootnote{\url{https://mks0601.github.io/InterHand2.6M/}}."



Paperid:905
Authors:Zhen Zhao, Miaojing Shi, Xiaoxiao Zhao, Li Li
Title: Active Crowd Counting with Limited Supervision
Abstract:
To learn a reliable people counter from crowd images, head center annotations are normally required. Annotating head centers is however a laborious and tedious process in dense crowds. In this paper, we present an active learning framework which enables accurate crowd counting with limited supervision: given a small labeling budget, instead of randomly selecting images to annotate, we first introduce an active labeling strategy to annotate the most informative images in the dataset and learn the counting model upon them. The process is repeated such that in every cycle we select the samples that are diverse in crowd density and dissimilar to previous selections. In the last cycle when the labeling budget is met, the large amount of unlabeled data are also utilized: a distribution classifier is introduced to align the labeled data with unlabeled data; furthermore, we propose to mix up the distribution labels and latent representations of data in the network to particularly improve the distribution alignment in-between training samples. We follow the popular density estimation pipeline for crowd counting. Extensive experiments are conducted on standard benchmarks i.e. ShanghaiTech, UCF CC 50, MAll, TRANCOS, and DCC. By annotating limited number of images (e.g. 10% of the dataset), our method reaches levels of performance not far from the state of the art which utilize full annotations of the dataset."



Paperid:906
Authors:Marvin Klingner, Jan-Aike Termhlen, Jonas Mikolajczyk, Tim Fingscheidt
Title: Self-Supervised Monocular Depth Estimation: Solving the Dynamic Object Problem by Semantic Guidance
Abstract:
Self-supervised monocular depth estimation presents a powerful method to obtain 3D scene information from single camera images, which is trainable on arbitrary image sequences without requiring depth labels, e.g., from a LiDAR sensor. In this work we present a new self-supervised semantically-guided depth estimation (SGDepth) method to deal with moving dynamic-class (DC) objects, such as moving cars and pedestrians, which violate the static-world assumptions typically made during training of such models. Specifically, we propose (i) mutually beneficial cross-domain training of (supervised) semantic segmentation and self-supervised depth estimation with task-specific network heads, (ii) a semantic masking scheme providing guidance to prevent moving DC objects from contaminating the photometric loss, and (iii) a detection method for frames with non-moving DC objects, from which the depth of DC objects can be learned. We demonstrate the performance of our method on several benchmarks, in particular on the Eigen split, where we exceed all baselines without test-time refinement."



Paperid:907
Authors:Shaoxiang Chen, Yu-Gang Jiang
Title: Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language
Abstract:
Temporal Activity Localization via Language (TALL) in video is a recently proposed challenging vision task, and tackling it requires fine-grained understanding of the video content, however, this is overlooked by most of the existing works. In this paper, we propose a novel TALL method which builds a Hierarchical Visual-Textual Graph to model interactions between the objects and words as well as among the objects to jointly understand the video contents and the language. We also design a convolutional network with cross-channel communication mechanism to further encourage the information passing between the visual and textual modalities. Finally, we propose a loss function that enforces alignment of the visual representation of the localized activity and the sentence representation, so that the model can predict more accurate temporal boundaries. We evaluated our proposed method on two popular benchmark datasets: Charades-STA and ActivityNet Captions, and achieved state-of-the-art performances on both datasets. Code is available at \url{https://github.com/forwchen/HVTG}."



Paperid:908
Authors:Thibaut Issenhuth, Jérémie Mary, Clément Calauzènes
Title: Do Not Mask What You Do Not Need to Mask: a Parser-Free Virtual Try-On
Abstract:
The 2D virtual try-on task has recently attracted a great interest from the research community, for its direct potential applications in online shopping as well as for its inherent and non-addressed scientific challenges. This task requires fitting an in-shop cloth image on the image of a person, which is highly challenging because it involves cloth warping, image compositing, and synthesizing. Casting virtual try-on into a supervised task faces a difficulty: available datasets are composed of pairs of pictures (cloth, person wearing the cloth). Thus, we have no access to ground-truth when the cloth on the person changes. State-of-the-art models solve this by masking the cloth information on the person with both a human parser and a pose estimator. Then, image synthesis modules are trained to reconstruct the person image from a masked person image and a cloth image. This procedure has several caveats: firstly, human parsers are prone to errors; secondly, it is a costly pre-processing step, which also has to be applied at inference time; finally, it makes the task harder than it is since the mask covers information that should be kept such as hands or accessories. In this paper, we propose a novel student-teacher paradigm where the teacher is trained in the standard way (reconstruction) before guiding the student to focus on the initial task (changing the cloth). The student additionally learns from an adversarial loss, which pushes it to follow the distribution of the real images. Consequently, the student exploits information that is masked to the teacher. A student trained without the adversarial loss would not use this information. Also, getting rid of both human parser and pose estimator at inference time allows obtaining a real-time virtual try-on."



Paperid:909
Authors:Yuren Cong, Hanno Ackermann, Wentong Liao, Michael Ying Yang, Bodo Rosenhahn
Title: NODIS: Neural Ordinary Differential Scene Understanding
Abstract:
Semantic image understanding is a challenging topic in computer vision. It requires to detect all objects in an image, but also to identify all the relations between them. Detected objects, their labels and the discovered relations can be used to construct a scene graph which provides an abstract semantic interpretation of an image. In previous works, relations were identified by solving an assignment problem formulated as Mixed-Integer Linear Programs. In this work, we interpret that formulation as Ordinary Differential Equation (ODE). The proposed architecture performs scene graph inference by solving a neural variant of an ODE by end-to-end learning. It achieves state-of-the-art results on all three benchmark tasks: scene graph generation (SGGen), classification (SGCls) and visual relationship detection (PredCls) on Visual Genome benchmark."



Paperid:910
Authors:Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
Title: AssembleNet++: Assembling Modality Representations via Attention Connections - Supplementary Material -
Abstract:
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network. A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality. Even without pre-training, our models outperform the previous work on standard public activity recognition datasets with continuous videos, establishing new state-of-the-art. We also confirm that our findings of having neural connections from the object modality and the use of peer-attention is generally applicable for different existing architectures, improving their performances. We name our model explicitly as AssembleNet++. The code will be available at: https://sites.google.com/corp/view/assemblenet/"



Paperid:911
Authors:Yiding Yang, Jiayan Qiu, Mingli Song, Dacheng Tao, Xinchao Wang
Title: Learning Propagation Rules for Attribution Map Generation
Abstract:
Existing gradient-based attribution-map methods rely on hand-crafted propagation rules for the non-linear/activation layers during the backward pass, so as to produce gradients of the input and then the attribution map. Despite the promising results achieved, such methods are sensitive to the non-informative high-frequency components and lack adaptability for various models and samples. In this paper, we propose a dedicated method to generate attribution maps that allow us to learn the propagation rules automatically, overcoming the flaws of the hand-crafted ones. Specifically, we introduce a learnable plugin module, which enables adaptive propagation rules for each pixel, to the non-linear layers during the backward pass for mask generating. The masked input image is then fed into the model again to obtain new output that can be used as a guidance when combined with the original one. The introduced learnable module can be trained under any auto-grad framework with higher-order differential support. As demonstrated on several datasets and network architectures, the proposed method yields state-of-the-art results and gives cleaner and more visually plausible attribution maps."



Paperid:912
Authors:Menelaos Kanakis, David Bruggemann, Suman Saha, Stamatios Georgoulis , Anton Obukhov, Luc Van Gool
Title: Reparameterizing Convolutions for Incremental Multi-Task Learning without Task Interference
Abstract:
Multi-task networks are commonly utilized to alleviate the need for a large number of highly specialized single-task networks. However, two common challenges in developing multi-task models are often overlooked in literature. First, enabling the model to be inherently incremental, continuously incorporating information from new tasks without forgetting the previously learned ones (incremental learning). Second, eliminating adverse interactions amongst tasks, which has been shown to significantly degrade the single-task performance in a multi-task setup (task interference). In this paper, we show that both can be achieved simply by reparameterizing the convolutions of standard neural network architectures into a non-trainable shared part (filter bank) and task-specific parts (modulators), where each modulator has a fraction of the filter bank parameters. Thus, our reparameterization enables the model to learn new tasks without adversely affecting the performance of existing ones. The results of our ablation study attest the efficacy of the proposed reparameterization. Moreover, our method achieves state-of-the-art on two challenging multi-task learning benchmarks, PASCAL-Context and NYUD, and also demonstrates superior incremental learning capability as compared to its close competitors. The code and models are made publicly available."



Paperid:913
Authors:Karl Schmeckpeper, Annie Xie, Oleh Rybkin, Stephen Tian, Kostas Daniilidis, Sergey Levine, Chelsea Finn
Title: Learning Predictive Models from Observation and Interaction
Abstract:
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but cannot always be combined with data from the original agent. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots. We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable for the interaction data and an unobserved variable for the observation data, and the second challenge by using a domain-dependent prior. In addition to interaction data, our method is able to leverage videos of passive observations in a driving dataset and a dataset of robotic manipulation videos to improve video prediction performance. We also demonstrate that, in a real-world tabletop robotic manipulation setting, we are able to significantly improve control performance by learning a model from both robot data and observations of humans."



Paperid:914
Authors:Bingyi Cao, André Araujo, Jack Sim
Title: Unifying Deep Local and Global Features for Image Search
Abstract:
Image retrieval is the problem of searching an image database for items that are similar to a query image. To address this task, two main types of image representations have been studied: global and local image features. In this work, our key contribution is to unify global and local features into a single deep model, enabling accurate retrieval with efficient feature extraction. We refer to the new model as DELG, standing for DEep Local and Global features. We leverage lessons from recent feature learning work and propose a model that combines generalized mean pooling for global features and attentive selection for local features. The entire network can be learned end-to-end by carefully balancing the gradient flow between two heads -- requiring only image-level labels. We also introduce an autoencoder-based dimensionality reduction technique for local features, which is integrated into the model, improving training efficiency and matching performance. Comprehensive experiments show that our model achieves state-of-the-art image retrieval on the Revisited Oxford and Paris datasets, and state-of-the-art single-model instance-level recognition on the Google Landmarks dataset v2. Code and models are available at https://github.com/tensorflow/models/tree/master/research/delf ."



Paperid:915
Authors:Jie Song, Xu Chen, Otmar Hilliges
Title: Human Body Model Fitting by Learned Gradient Descent
Abstract:
We propose a novel algorithm for the fitting of 3D human shape to images. Combining the accuracy and refinement capabilities of iterative gradient-based optimization techniques with the robustness of deep neural networks, we propose a gradient descent algorithm that leverages a neural network to predict the parameter update rule for each iteration. This per-parameter and state-aware update guides the optimizer towards a good solution in very few steps, converging in typically few steps. During training our approach only requires MoCap data of human poses, parametrized via SMPL. From this data the network learns a subspace of valid poses and shapes in which optimization is performed much more efficiently. The approach does not require any hard to acquire image-to-3D correspondences. At test time we only optimize the 2D joint re-projection error without the need for any further priors or regularization terms. We show empirically that this algorithm is fast (avg. 120ms convergence), robust to initialization and dataset, and achieves state-of-the-art results on public evaluation datasets including the challenging 3DPW in-the-wild benchmark (improvement over SMPLify ($45\%$) and also approaches using image-to-3D correspondences). "



Paperid:916
Authors:Matthew Korban, Xin Li
Title: DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition
Abstract:
We propose a Dynamic Directed Graph Convolutional Network (DDGCN) to model spatial and temporal features of human actions from their skeletal representations. The DDGCN consists of three new feature modeling modules: (1) Dynamic Convolutional Sampling (DCS), (2) Dynamic Convolutional Weight (DCW) assignment, and (3) Directed Graph Spatial-Temporal (DGST) feature extraction. Comprehensive experiments show that the DDGCN outperforms existing state-of-the-art action recognition approaches in various testing datasets. Our source code and model will be released at http://www.ece.lsu.edu/xinli/ActionModeling/index.html ."



Paperid:917
Authors:Fei Ye, Adrian G. Bors
Title: Learning latent representations across multiple data domains using Lifelong VAEGAN
Abstract:
The problem of catastrophic forgetting occurs in deep learning models trained on multiple databases in a sequential manner. Recently, generative replay mechanisms (GRM), have been proposed to reproduce previously learned knowledge aiming to reduce the forgetting. However, such approaches lack an appropriate inference model and therefore can not provide latent representations of data. In this paper, we propose a novel lifelong learning approach, namely the Lifelong VAEGAN (L-VAEGAN), which not only induces a powerful generative replay network but also learns meaningful latent representations, benefiting representation learning. L-VAEGAN can allow to automatically embed the information associated with different domains into several clusters in the latent space, while also capturing semantically meaningful shared latent variables, across different data domains. The proposed model supports many downstream tasks that traditional generative replay methods can not, including interpolation and inference across different data domains."



Paperid:918
Authors:Miao Liao, Feixiang Lu, Dingfu Zhou, Sibo Zhang, Wei Li, Ruigang Yang
Title: DVI: Depth Guided Video Inpainting for Autonomous Driving
Abstract:
To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video. This happens quite often for street view videos when there is a vehicle in the front. In this case, we can capture the scene a second time when the desired area becomes visible and do video fusion inpainting. To our knowledge, we are the first to fuse multiple videos for video inpainting. In order to evaluate our method, we collected 5 hour synchronized lidar and camera data from autonomous driving cars in the urban roads. To address long-time occlusion problems, we collect data for the same road in different days and times. The experimental results show that the proposed approach outperforms the state-of-the-art approaches for all the criteria, especially the RMSE (Root Mean Squared Error) has been reduced by about %13."



Paperid:919
Authors:Kenan E. Ak, Ning Xu, Zhe Lin, Yilin Wang
Title: Incorporating Reinforced Adversarial Learning in Autoregressive Image Generation
Abstract:
Autoregressive models recently achieved comparable results versus state-of-the-art Generative Adversarial Networks (GANs) with the help of Vector Quantized Variational AutoEncoders (VQ-VAE). However, autoregressive models have several limitations such as exposure bias and their training objective does not guarantee visual fidelity. To address these limitations, we propose to use Reinforced Adversarial Learning (RAL) based on policy gradient optimization for autoregressive models. By applying RAL, we enable a similar process for training and testing to address the exposure bias issue. In addition, visual fidelity has been further optimized with adversarial loss inspired by their strong counterparts: GANs. Due to the slow sampling speed of autoregressive models, we propose to use partial generation for faster training. RAL also empowers the collaboration between different modules of the VQ-VAE framework. To our best knowledge, the proposed method is first to enable adversarial learning in autoregressive models for image generation. Experiments on synthetic and real-world datasets show improvements over the MLE trained models. The proposed method improves both negative log-likelihood (NLL) and Fréchet Inception Distance (FID), which indicates improvements in terms of visual quality and diversity. The proposed method achieves state-of-the-art results on Celeba for 64x64 image resolution, showing promise for large scale image generation."



Paperid:920
Authors:A. Braunegg, Amartya Chakraborty, Michael Krumdick, Nicole Lape, Sara Leary, Keith Manville, Elizabeth Merkhofer, Laura Strickhart, Matthew Walmer
Title: APRICOT: A Dataset of Physical Adversarial Attacks on Object Detection
Abstract:
Physical adversarial attacks threaten to fool object detection systems, but reproducible research on the real-world effectiveness of physical patches and how to defend against them requires a publicly available benchmark dataset. We present APRICOT, a collection of over 1,000 annotated photographs of printed adversarial patches in public locations. The patches target several object categories for three COCO-trained detection models, and the photos represent natural variation in position, distance, lighting conditions, and viewing angle. Our analysis suggests that maintaining adversarial robustness in uncontrolled settings is highly challenging, but it is still possible to produce targeted detections under white-box and sometimes black-box settings. We establish baselines for defending against adversarial patches through several methods, including a detector supervised with synthetic data and unsupervised methods such as kernel density estimation, Bayesian uncertainty, and reconstruction error. Our results suggest that adversarial patches can be effectively flagged, both in a high-knowledge, attack-specific scenario, and in an unsupervised setting where patches are detected as anomalies in natural images. This dataset and the described experiments provide a benchmark for future research on the effectiveness of and defenses against physical adversarial objects in the wild."



Paperid:921
Authors:Ankan Bansal, Yuting Zhang, Rama Chellappa
Title: Visual Question Answering on Image Sets
Abstract:
We introduce the task of Image-Set Visual Question Answering (ISVQA), which generalizes the commonly studied single-image VQA problem to multi-image settings. Taking a natural language question and a set of images as input, it aims to answer the question based on the content of the images. The questions can be about objects and relationships in one or more images or about the entire scene depicted by the image set. To enable research in this new topic, we introduce two ISVQA datasets - indoor and outdoor scenes. They simulate the real-world scenarios of indoor image collections and multiple car-mounted cameras, respectively. The indoor-scene dataset contains 91,479 human-annotated questions for 48,138 image sets, and the outdoor-scene dataset has 49,617 questions for 12,746 image sets. We analyze the properties of the two datasets, including question-and-answer distributions, types of questions, biases in dataset, and question-image dependencies. We also build new baseline models to investigate new research challenges in ISVQA."



Paperid:922
Authors:Qi Chen, Lin Sun, Zhixin Wang, Kui Jia, Alan Yuille
Title: Object as Hotspots: An Anchor-Free 3D Object Detection Approach via Firing of Hotspots
Abstract:
Accurate 3D object detection in LiDAR based point clouds suffers from the challenges of data sparsity and irregularities. Existing methods strive to organize the points regularly, e.g. voxelize, pass them through a designed 2D/3D neural network, and then define object-level anchors that predict offsets of 3D bounding boxes using collective evidences from all the points on the objects of interest. Contrary to the state-of-the-art anchor-based methods, based on the very nature of data sparsity, we observe that even points on an individual object part are informative about semantic information of the object. We thus argue in this paper for an approach opposite to existing methods using object-level anchors. Inspired by compositional models, which represent an object as parts and their spatial relations, we propose to represent an object as composition of its interior non-empty voxels, termed hotspots, and the spatial relations of hotspots. This gives rise to the representation of Object as Hotspots (OHS). Based on OHS, we further propose an anchor-free detection head with a novel ground truth assignment strategy that deals with inter-object point-sparsity imbalance to prevent the network from biasing towards objects with more points. Experimental results show that our proposed method works remarkably well on objects with a small number of points. Notably, our approach ranked 1st on KITTI 3D Detection Benchmark for cyclist and pedestrian detection, and achieved state-of-the-art performance on NuScenes 3D Detection Benchmark."



Paperid:923
Authors:Huaiyi Huang, Yuqi Zhang, Qingqiu Huang, Zhengkui Guo, Ziwei Liu, Dahua Lin
Title: Placepedia: Comprehensive Place Understanding with Multi-Faceted Annotations
Abstract:
Place is an important element in visual understanding. Given a photo of a building, people can often tell its functionality, e.g. a restaurant or a shop, its cultural style, e.g. Asian or European, as well as its economic type, e.g. industry oriented or tourism oriented. While place recognition has been widely studied in previous work, there remains a long way towards comprehensive place understanding, which is far beyond categorizing a place with an image and requires information of multiple aspects. In this work, we contribute Placepedia1, a large-scale place dataset with more than 35M photos from 240K unique places. Besides the photos, each place also comes with massive multi-faceted information, e.g. GDP, population, etc., and labels at multiple levels, including function, city, country, etc.. This dataset, with its large amount of data and rich annotations, allows various studies to be conducted. Particularly, in our studies, we develop 1) PlaceNet, a uni ed framework for multi-level place recognition, and 2) a method for city embedding, which can produce a vector representation for a city that captures both visual and multi-faceted side information. Such studies not only reveal key challenges in place understanding, but also establish connections between visual observations and underlying socioeconomic/cultural implications."



Paperid:924
Authors:Ayan Sinha, Zak Murez, James Bartolozzi, Vijay Badrinarayanan, Andrew Rabinovich
Title: DELTAS: Depth Estimation by Learning Triangulation And densification of Sparse points
Abstract:
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation. Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems. However, this accuracy comes at a high computational cost which impedes practical adoption. Distinct from cost volume approaches, we propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally densifying this sparse set of 3D points using CNNs. An end-to-end network efficiently performs all three steps within a deep learning framework and trained with intermediate 2D image and 3D geometric supervision, along with depth supervision. Crucially, our first step complements pose estimation using interest point detection and descriptor learning. We demonstrate state-of-the-art results on depth estimation with lower compute for different scene lengths. Furthermore, our method generalizes to newer environments and the descriptors output by our network compare favorably to strong baselines. "



Paperid:925
Authors:Yiheng Chi, Abhiram Gnanasambandam, Vladlen Koltun, Stanley H. Chan
Title: Dynamic Low-light Imaging with Quanta Image Sensors
Abstract:
Imaging in low light is difficult because the number of photons arriving at the sensor is low. Imaging dynamic scenes in low-light environments is even more difficult because as the scene moves, pixels in adjacent frames need to be aligned before they can be denoised. Conventional CMOS image sensors (CIS) are at a particular disadvantage in dynamic low-light settings because the exposure cannot be too short lest the read noise overwhelms the signal. We propose a solution using Quanta Image Sensors (QIS) and present a new image reconstruction algorithm. QIS are single-photon image sensors with photon counting capabilities. Studies over the past decade have confirmed the effectiveness of QIS for low-light imaging but reconstruction algorithms for dynamic scenes in low light remain an open problem. We fill the gap by proposing a student-teacher training protocol that transfers knowledge from a motion teacher and a denoising teacher to a student network. We show that dynamic scenes can be reconstructed from a burst of frames at a photon level of 1 photon per pixel per frame. Experimental results confirm the advantages of the proposed method compared to existing methods."



Paperid:926
Authors:Mark Nishimura, David B. Lindell, Christopher Metzler, Gordon Wetzstein
Title: Disambiguating Monocular Depth Estimation with a Single Transient
Abstract:
Monocular depth estimation algorithms successfully predict the relative depth order of objects in a scene. However, because of the fundamental scale ambiguity associated with monocular images, these algorithms fail at correctly predicting true metric depth. In this work, we demonstrate how a depth histogram of the scene, which can be readily captured using a single-pixel time-resolved detector, can be fused with the output of existing monocular depth estimation algorithms to resolve the depth ambiguity problem. We validate this novel sensor fusion technique experimentally and in extensive simulation. We show that it significantly improves the performance of several state-of-the-art monocular depth estimation algorithms."



Paperid:927
Authors:Wenyuan Zeng, Shenlong Wang, Renjie Liao, Yun Chen, Bin Yang, Raquel Urtasun
Title: DSDNet: Deep Structured self-Driving Network
Abstract:
In this paper, we propose the Deep Structured self-Driving Network (DSDNet), which performs object detection, motion prediction, and motion planning with a single neural network. Towards this goal, we develop a deep structured energy based model which considers the interactions between actors and produces socially consistent multimodal future predictions. Furthermore, DSDNet explicitly exploits the predicted future distributions of actors to plan a safe maneuver by using a structured planning cost. Our sample-based formulation allows us to overcome the difficulty in probabilistic inference of continuous random variables. Experiments on a number of large-scale self driving datasets demonstrate that our model significantly outperforms the state-of-the-art."



Paperid:928
Authors:Himalaya Jain, Spyros Gidaris, Nikos Komodakis, Patrick Pérez, Matthieu Cord
Title: QuEST: Quantized Embedding Space for Transferring Knowledge
Abstract:
Knowledge distillation refers to the process of training a student network to achieve better accuracy by learning from a pre-trained teacher network. Most of the existing knowledge distillation methods direct the student to follow the teacher by matching the teacher's output, feature maps or their distribution. In this work, we propose a novel way to achieve this goal: by distilling the knowledge through a quantized visual words space. According to our method, the teacher's feature maps are first quantized to represent the main visual concepts (i.e., visual words) encompassed in these maps and then the student is asked to predict those visual word representations. Despite its simplicity, we show that our approach is able to yield results that improve the state of the art on knowledge distillation for model compression and transfer learning scenarios. To that end, we provide an extensive evaluation across several network architectures and most commonly used benchmark datasets."



Paperid:929
Authors:Rongchang Zhao, Xuanlin Chen, Zailiang Chen, Shuo Li
Title: EGDCL: An Adaptive Curriculum Learning Framework for Unbiased Glaucoma Diagnosis
Abstract:
Today's computer-aided diagnosis (CAD) model is still far from the clinical practice of glaucoma detection, mainly due to the training bias originating from 1) the normal-abnormal class imbalance and 2) the rare but significant hard samples in fundus images. However, debiasing in CAD is not trivial because existing methods cannot cure the two types of bias to categorize fundus images. In this paper, we propose a novel curriculum learning paradigm (EGDCL) to train an unbiased glaucoma diagnosis model with the adaptive dual-curriculum. Innovatively, the dual-curriculum is designed with the guidance of evidence maps to build a training criterion, which gradually cures the bias in training data. In particular, the dual-curriculum emphasizes unbiased training contributions of data from easy to hard, normal to abnormal, and the dual-curriculum is optimized jointly with model parameters to obtain the optimal solution. In comparison to baselines, EGDCL significantly improves the convergence speed of the training process and obtains the top performance in the test procedure. Experimental results on challenging glaucoma datasets show that our EGDCL delivers unbiased diagnosis (0.9721 of Sensitivity, 0.9707 of Specificity, 0.993 of AUC, 0.966 of F2-score) and outperform the other methods. It endows our EGDCL a great advantage to handle the unbiased CAD in clinical application."



Paperid:930
Authors:Gukyeong Kwon, Mohit Prabhushankar, Dogancan Temel, Ghassan AlRegib
Title: Backpropagated Gradient Representations for Anomaly Detection
Abstract:
Learning representations that clearly distinguish between normal and abnormal data is key to the success of anomaly detection. Most of existing anomaly detection algorithms use activation representations from forward propagation while not exploiting gradients from backpropagation to characterize data. Gradients capture model updates required to represent data. Anomalies require more drastic model updates to fully represent them compared to normal data. Hence, we propose the utilization of backpropagated gradients as representations to characterize model behavior on anomalies and, consequently, detect such anomalies. We show that the proposed method using gradient-based representations achieves state-of-the-art anomaly detection performance in benchmark image recognition datasets. Also, we highlight the computational efficiency and the simplicity of the proposed method in comparison with other state-of-the-art methods relying on adversarial networks or autoregressive models, which require at least 27 times more model parameters than the proposed method. "



Paperid:931
Authors:Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang Raquel Urtasun, Liwei Wang , Stephen Lin, Han Hu
Title: Dense RepPoints: Representing Visual Objects with Dense Point Sets
Abstract:
We present a new object representation, called Dense Rep-Points, which utilize a large number of points to describe the multi-grainedobject representation of both box level and pixel level. Techniques are pro-posed to efficiently process these dense points, which maintains nearconstant complexity with increasing point number. The dense RepPointsis proved to represent and learn object segment well, by a novel dis-tance transform sampling method combined with a set-to-set supervision.The novel distance transform sampling method combines the strengthof contour and grid representation, which significantly outperforms thecounter-parts using contour or grid representations. On COCO, it achieves39.6 mask AP and 48.3 bbox AP."



Paperid:932
Authors:Xikun Zhang, Chang Xu, Dacheng Tao
Title: On Dropping Clusters to Regularize Graph Convolutional Neural Networks
Abstract:
Dropout has been widely adopted to regularize graph convolutional networks (GCNs) by randomly zeroing entries of the node feature vectors and obtains promising performance on various tasks. However, the information of individually zeroed entries could still present in other correlated entries by propagating (1) spatially between entries of different node feature vectors and (2) depth-wisely between different entries of each node feature vector, which essentially weakens the effectiveness of dropout. This is mainly because in a GCN, neighboring node feature vectors after linear transformations are aggregated to produce new node feature vectors in the subsequent layer. To effectively regularize GCNs, we devise DropCluster which first randomly zeros some seed entries and then zeros entries that are spatially or depth-wisely correlated to those seed entries. In this way, the information of the seed entries is thoroughly removed and cannot flow to subsequent layers via the correlated entries. We validate the effectiveness of the proposed DropCluster by comprehensively comparing it with dropout and its representative variants, such as SpatialDropout, Gaussian dropout and DropEdge, on skeleton-based action recognition."



Paperid:933
Authors:Mrigank Rochan, Mahesh Kumar Krishna Reddy, Linwei Ye, Yang Wang
Title: Adaptive Video Highlight Detection by Learning from User History
Abstract:
Recently, there is an increasing interest in highlight detection research where the goal is to create a short duration video from a longer video by extracting its interesting moments. However, most existing methods ignore the fact that the definition of video highlight is highly subjective. Different users may have different preferences of highlight for the same input video. In this paper, we propose a simple yet effective framework that learns to adapt highlight detection to a user by exploiting the user's history in the form of highlights that the user has previously created. Our framework consists of two sub-networks: a fully temporal convolutional highlight detection network $H$ that predicts highlight for an input video and a history encoder network $M$ for user history. We introduce a newly designed temporal-adaptive instance normalization (T-AIN) layer to $H$ where the two sub-networks interact with each other. T-AIN has affine parameters that are predicted from $M$ based on the user history and is responsible for the user-adaptive signal to $H$. Extensive experiments on a large-scale dataset show that our framework can make more accurate and user-specific highlight predictions."



Paperid:934
Authors:Shuyang Cheng, Zhaoqi Leng, Ekin Dogus Cubuk, Barret Zoph, Chunyan Bai, Jiquan Ngiam, Yang Song, Benjamin Caine, Vijay Vasudevan, Congcong Li, Quoc V. Le, Jonathon Shlens, Dragomir Anguelov
Title: Improving 3D Object Detection through Progressive Population Based Augmentation
Abstract:
Data augmentation has been widely adopted for object detection in 3D point clouds. However, all previous related efforts have focused on manually designing specific data augmentation methods for individual architectures. In this work, we present the first attempt to automate the design of data augmentation policies for 3D object detection. We introduce the Progressive Population Based Augmentation (PPBA) algorithm, which learns to optimize augmentation strategies by narrowing down the search space and adopting the best parameters discovered in previous iterations. On the KITTI 3D detection test set, PPBA improves the StarNet detector by substantial margins on the moderate difficulty category of cars, pedestrians, and cyclists, outperforming all current state-of-the-art single-stage detection models. Additional experiments on the Waymo Open Dataset indicate that PPBA continues to effectively improve the StarNet and PointPillars detectors on a 20x larger dataset compared to KITTI. The magnitude of the improvements may be comparable to advances in 3D perception architectures and the gains come without an incurred cost at inference time. In subsequent experiments, we find that PPBA may be up to 10x more data efficient than baseline 3D detection models without augmentation, highlighting that 3D detection models may achieve competitive accuracy with far fewer labeled examples."



Paperid:935
Authors:Jiongchao Jin, Akshay Gadi Patil, Zhang Xiong, Hao Zhang
Title: DR-KFS: A Differentiable Visual Similarity Metric for 3D Shape Reconstruction
Abstract:
We introduce a differential visual similarity metric to train deep neural networks for 3D reconstruction, aimed at improving reconstruction quality. The metric compares two 3D shapes by measuring distances between multi-view images differentiably rendered from the shapes. Importantly, the image-space distance is also differentiable and measures visual similarity, rather than pixel-wise distortion. Specifically, the similarity is defined by mean-squared errors over HardNet features computed from probabilistic keypoint maps of the compared images. Our differential visual shape similarity metric can be easily plugged into various 3D reconstruction networks, replacing their distortion-based losses, such as Chamfer or Earth Mover distances, so as to optimize the network weights to produce reconstructions with better structural fidelity and visual quality. We demonstrate this both objectively, using well-known shape metrics for retrieval and classification tasks that are independent from our new metric, and subjectively through a perceptual study."



Paperid:936
Authors:Xuefeng Hu, Zhihan Zhang, Zhenye Jiang, Syomantak Chaudhuri, Zhenheng Yang, Ram Nevatia
Title: SPAN: Spatial Pyramid Attention Network for Image Manipulation Localization
Abstract:
Tehchniques for manipulating images are advancing rapidly; while these are helpful for many useful tasks, they also pose a threat to society with their ability to create believable misinformation. We present a novel, Spatial Pyramid Attention Network (SPAN) for detection and localization of multiple types of image manipulations. The proposed architecture efficiently and effecively models the relationship between image patches at multiple scales by constructing a pyramid of local self-attention blocks. The design includes a novel position projection to encode the spatial positions of the patches. SPAN is trained on a synthetic dataset but can also be fine tuned for specific datasets; The proposed method shows significant gains in performance on standard datasets over previous state-of-art methods. "



Paperid:937
Authors:Jinghua Wang, Jianmin Jiang
Title: Adversarial Learning for Zero-shot Domain Adaptation
Abstract:
Zero-shot domain adaptation (ZSDA) is a category of domain adaptation problems where neither data sample nor label is available for parameter learning in the target domain. With the hypothesis that the shift between a given pair of domains is shared across tasks, we propose a new method for ZSDA by transferring domain shift from an irrelevant task (IrT) to the task of interest (ToI). Specifically, we first identify an IrT, where dual-domain samples are available, and capture the domain shift with a coupled generative adversarial networks (CoGAN) in this task. Then, we train a CoGAN for the ToI and restrict it to carry the same domain shift as the CoGAN for IrT does. In addition, we introduce a pair of co-training classifiers to regularize the training procedure of CoGAN in the ToI. The proposed method not only derives machine learning models for the non-available target-domain data, but also synthesizes the data themselves. We evaluate the proposed method on benchmark datasets and achieve the state-of-the-art performances. "



Paperid:938
Authors:Yukihiro Sasagawa, Hajime Nagahara
Title: YOLO in the Dark - Domain Adaptation Method for Merging Multiple Models -
Abstract:
Generating new visual tasks requires additional datasets, and it costs a considerable effort. We propose a new method of domain adaptation for merging multiple models with less effort than creating an additional dataset. This method merges pre-trained models in different domains using the glue layers and the generative model, which feeds latent features to train the glue layers without an additional dataset. We also propose the generative model created by knowledge distillation from pre-trained models. It also allows reusing the dataset to create latent features for training the glue layers. We apply this method to object detection in a low-light situation. The “YOLO in the Dark” contains two models, “Learning to See in the Dark” and YOLO. We report the new method and the result of domain adaptation that detect objects from raw short-exposure low-light images. The “YOLO in the Dark” costs fewer computing resources compared to the naive approach."



Paperid:939
Authors:Jae Sung Park, Trevor Darrell, Anna Rohrbach
Title: Identity-Aware Multi-Sentence Video Description
Abstract:
Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities. To be able to tackle both tasks, we augment the Large Scale Movie Description Challenge (LSMDC) benchmark with new annotations suited for our problem statement. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works, and allows us to generate descriptions with locally re-identi fied people."



Paperid:940
Authors:Tejas Gokhale, Pratyay Banerjee, Chitta Baral, Yezhou Yang
Title: VQA-LOL: Visual Question Answering under the Lens of Logic
Abstract:
Logical connectives and their implications on the meaning of a natural language sentence are a fundamental aspect of understanding. In this paper, we investigate whether visual question answering (VQA) systems trained to answer a question about an image, are able to answer the logical composition of multiple such questions. When put under this extit{Lens of Logic}, state-of-the-art VQA models have difficulty in correctly answering these logically composed questions. We construct an augmentation of the VQA dataset as a benchmark, with questions containing logical compositions and linguistic transformations (negation, disjunction, conjunction, and antonyms). We propose our {Lens of Logic (LOL)} model which uses question-attention and logic-attention to understand logical connectives in the question, and a novel Fréchet-Compatibility Loss, which ensures that the answers of the component questions and the composed question are consistent with the inferred logical operation. Our model shows substantial improvement in learning logical compositions while retaining performance on VQA. We suggest this work as a move towards robustness by embedding logical connectives in visual understanding."



Paperid:941
Authors:Mengyao Zhai, Lei Chen, Jiawei He, Megha Nawhal, Frederick Tung, Greg Mori
Title: Piggyback GAN: Efficient Lifelong Learning for Image Conditioned Generation
Abstract:
Humans accumulate knowledge in a lifelong fashion. Modern deep neural networks, on the other hand, are susceptible to catastrophic forgetting: when adapted to perform new tasks, they often fail to preserve their performance on previously learned tasks. Given a sequence of tasks, a naive approach addressing catastrophic forgetting is to train a separate standalone model for each task, which scales the total number of parameters drastically without efficiently utilizing previous models. In contrast, we propose a parameter efficient framework, Piggyback GAN, which learns the current task by building a set of convolutional and deconvolutional filters that are factorized into filters of the models trained on previous tasks. For the current task, our model achieves high generation quality on par with a standalone model at a lower number of parameters. For previous tasks, our model can also preserve generation quality since the filters for previous tasks are not altered. We validate Piggyback GAN on various image-conditioned generation tasks across different domains, and provide qualitative and quantitative results to show that the proposed approach can address catastrophic forgetting effectively and efficiently."



Paperid:942
Authors:Xiaofeng Yang, Guosheng Lin, Fengmao Lv, Fayao Liu
Title: TRRNet: Tiered Relation Reasoning for Compositional Visual Question Answering
Abstract:
Compositional visual question answering requires reasoning over both semantic and geometry object relations. We propose a novel tiered reasoning method that dynamically selects object level candidates based on language representations and generates robust pairwise relations within the selected candidate objects. The proposed tiered relation reasoning method can be compatible with the majority of the existing visual reasoning frameworks, leading to significant performance improvement with very little extra computational cost. Moreover, we propose a policy network that decides the appropriate reasoning steps based on question complexity and current reasoning status. In experiments, our model achieves state-of-the-art performance on two VQA datasets. "



Paperid:943
Authors:Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao
Title: Mining Inter-Video Proposal Relations for Video Object Detection
Abstract:
Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection. However, these approaches mainly exploit the intra-proposal relation within single video, while ignoring the intra-proposal relation among different videos, which can provide important discriminative cues for recognizing confusing objects. To address the limitation, we propose a novel Inter-Video Proposal Relation module. Based on a concise multi-level triplet selection scheme, this module can learn effective object representations via modeling relations of hard proposals among different videos. Moreover, we design a Hierarchical Video Relation Network (HVR-Net), by integrating intra-video and inter-video proposal relations in a hierarchical fashion. This design can progressively exploit both intra and inter contexts to boost video object detection. We examine our method on the large-scale video object detection benchmark, i.e., ImageNet VID, where HVR-Net achieves the SOTA results. Codes and models will be released afterwards."



Paperid:944
Authors:Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
Title: TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval
Abstract:
We introduce TV show Retrieval (TVR), a new multimodal retrieval dataset. TVR requires systems to understand both videos and their associated subtitle (dialogue) texts, making it more realistic. The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres, where each query is associated with a tight temporal window. The queries are also labeled with query types that indicate whether each of them is more related to video or subtitle or both, allowing for in-depth analysis of the dataset and the methods that built on top of it. Strict qualification and post-annotation verification tests are applied to ensure the quality of the collected data. Additionally, we present several baselines and a novel Cross-modal Moment Localization (XML) network for multimodal moment retrieval tasks. The proposed XML model uses a late fusion design with a novel Convolutional StartEnd detector (ConvSE), surpassing baselines by a large margin and with better efficiency, providing a strong starting point for future work."



Paperid:945
Authors:Ying Jin, Ximei Wang, Mingsheng Long(), Jianmin Wang
Title: Minimum Class Confusion for Versatile Domain Adaptation
Abstract:
Domain Adaptation (DA) transfers a learning model from a labeled source domain to an unlabeled target domain which follows different distributions. There are a variety of DA scenarios subject to label sets and domain configurations, including closed-set and partial-set DA, as well as multi-source and multi-target DA. It is notable that existing DA methods are generally designed only for a specific scenario, and may underperform for scenarios they are not tailored to. In this paper, we propose Versatile Domain Adaptation (VDA), where one method can handle several different scenarios at the same time. Towards handling it, a more general inductive bias other than the domain alignment should be explored. We delve into a missing piece of existing methods: class confusion, the tendency that a classifier confuses the predictions between the correct and ambiguous classes for target examples, which exists in different scenarios. We unveil that reducing such pair-wise class confusion brings about significant transfer gains. Based on this, we propose a general loss function: Minimum Class Confusion (MCC). It can be characterized by (1) a non-adversarial DA method without explicitly deploying domain alignment, enjoying fast convergence speed; (2) a versatile approach that can handle the four existing scenarios: Closed-Set, Partial-Set, Multi-Source, and Multi-Target DA, outperforming the state-of-the-art methods in these scenarios, especially on the largest and hardest dataset to date (7.3% on DomainNet). Strong performance in the two scenarios proposed in this paper: Multi-Source Partial and Multi-Target Partial DA, further proves its versatility. In addition, it can also be used as a general regularizer that is orthogonal and complementary to a variety of existing DA methods, accelerating convergence and pushing those readily competitive methods to a stronger level. Code is released in https://github.com/thuml/MCC."



Paperid:946
Authors:Tong Wang, Yousong Zhu, Chaoyang Zhao, Wei Zeng, Yaowei Wang, Jinqiao Wang, Ming Tang
Title: Large Batch Optimization for Object Detection: Training COCO in 12 Minutes
Abstract:
Most of existing object detectors usually adopt a small training batch size ( ~16), which severely hinders the whole community from exploring large-scale datasets due to the extremely long training procedure. In this paper, we propose a versatile large batch optimization framework for object detection, named LargeDet, which successfully scales the batch size to larger than 1K for the first time. Specifically, we present a novel Periodical Moments Decay LAMB (PMD-LAMB) algorithm to effectively reduce the negative effects of the lagging historical gradients. Additionally, the Synchronized Batch Normalization (SyncBN) is utilized to help fast convergence. With LargeDet, we can not only prominently shorten the training period, but also significantly improve the detection accuracy of sparsely annotated large-scale datasets. For instance, we can finish the training of ResNet50 FPN detector on COCO within 12 minutes. Moreover, we achieve 12.2% mAP@0.5 absolute improvement for ResNet50 FPN on Open Images by training with batch size 640. "



Paperid:947
Authors:K. Ram Prabhakar, Susmit Agrawal, Durgesh Kumar Singh, Balraj Ashwath , R. Venkatesh Babu
Title: Towards Practical and Efficient High-Resolution HDR Deghosting with CNN
Abstract:
Generating High Dynamic Range (HDR) image in the presence of camera and object motion is a tedious task. If uncorrected, these motions will manifest as ghosting artifacts in the fused HDR image. On one end of the spectrum, there exist methods that generate high-quality results that are computationally demanding and too slow. On the other end, there are few faster methods that produce unsatisfactory results. With ever increasing sensor/display resolution, currently we are very much in need of faster methods that produce high-quality images. In this paper, we present a deep neural network based approach to generate high-quality ghost-free HDR for high-resolution images. Our proposed method is fast and fuses a sequence of three high-resolution images (16-megapixel resolution) in about 10 seconds. Through experiments and ablations, on different publicly available datasets, we show that the proposed method achieves state-of-the-art performance in terms of accuracy and speed."



Paperid:948
Authors:Deniz Beker, Hiroharu Kato, Mihai Adrian Morariu, Takahiro Ando, Toru Matsuoka, Wadim Kehl, Adrien Gaidon
Title: Monocular Differentiable Rendering for Self-Supervised 3D Object Detection
Abstract:
3D object detection from monocular images is an ill-posed problem due to the projective entanglement of depth and scale. To overcome this ambiguity, we present a novel self-supervised method for textured 3D shape reconstruction and pose estimation of rigid objects with the help of strong shape priors and 2D instance masks. Our method predicts the 3D location and meshes of each object in an image using differentiable rendering and a self-supervised objective derived from a pretrained monocular depth estimation network. We use the KITTI 3D object detection dataset to evaluate the accuracy of the method. Experiments demonstrate that we can effectively use noisy monocular depth and differentiable rendering as an alternative to expensive 3D ground-truth labels or LiDAR information."



Paperid:949
Authors:Meng Tian, Marcelo H Ang Jr, Gim Hee Lee
Title: Shape Prior Deformation for Categorical 6D Object Pose and Size Estimation
Abstract:
We present a novel learning approach to recover the 6D poses and sizes of unseen object instances from an RGB-D image. To handle the intra-class shape variation, we propose a deep network to reconstruct the 3D object model by explicitly modeling the deformation from a pre-learned categorical shape prior. Additionally, our network infers the dense correspondences between the depth observation of the object instance and the reconstructed 3D model to jointly estimate the 6D object pose and size. We design an autoencoder that trains on a collection of object models and compute the mean latent embedding for each category to learn the categorical shape priors. Extensive experiments on both synthetic and real-world datasets demonstrate that our approach significantly outperforms the state of the art. Our code is available at https://github.com/mentian/object-deformnet."



Paperid:950
Authors:Chaofan Tao, Qinhong Jiang, Lixin Duan, Ping Luo
Title: Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction
Abstract:
Multi-agent motion prediction is challenging because it aims to foresee the future trajectories of multiple agents (g pedestrians) simultaneously in a complicated scene. Existing work addressed this challenge by either learning social spatial interactions represented by the positions of a group of pedestrians, while ignoring their temporal coherence (\ie dependencies between different long trajectories), or by understanding the complicated scene layout (g scene segmentation) to ensure safe navigation. However, unlike previous work that isolated the spatial interaction, temporal coherence, and scene layout, this paper designs a new mechanism, \ie, Dynamic and Static Context-aware Motion Predictor (DSCMP), to integrates these rich information into the long-short-term-memory (LSTM). It has three appealing benefits. (1) DSCMP models the dynamic interactions between agents by learning both their spatial positions and temporal coherence, as well as understanding the contextual scene layout. (2) Different from previous LSTM models that predict motions by propagating hidden features frame by frame, limiting the capacity to learn correlations between long trajectories, we carefully design a differentiable queue mechanism in DSCMP, which is able to explicitly memorize and learn the correlations between long trajectories. (3) DSCMP captures the context of scene by inferring latent variable, which enables multimodal predictions with meaningful semantic scene layout. Extensive experiments show that DSCMP outperforms state-of-the-art methods by large margins, such as 9.05\% and 7.62\% relative improvements on the ETH-UCY and SDD datasets respectively."



Paperid:951
Authors:Xu Zhong, Elaheh ShafieiBavani, Antonio Jimeno Yepes
Title: Image-based table recognition: data, model, and evaluation
Abstract:
model, and evaluation","Important information that relates to a specific topic in a document is often organized in tabular format to assist readers with information retrieval and comparison, which may be difficult to provide in natural language. However, tabular data in unstructured digital documents, e.g., Portable Document Format (PDF) and images, are difficult to parse into structured machine-readable format, due to complexity and diversity in their structure and style. To facilitate image-based table recognition with deep learning, we develop and release the largest publicly available table recognition dataset PubTabNet, containing 568k table images with corresponding structured HTML representation. PubTabNet is automatically generated by matching the XML and PDF representations of the scientific articles in PubMed Central Open Access Subset (PMCOA). We also propose a novel attention-based encoder-dual-decoder (EDD) architecture that converts images of tables into HTML code. The model has a structure decoder which reconstructs the table structure and helps the cell decoder to recognize cell content. In addition, we propose a new Tree-Edit-Distance-based Similarity (TEDS) metric for table recognition, which more appropriately captures multi-hop cell misalignment and OCR errors than the pre-established metric. The experiments demonstrate that the EDD model can accurately recognize complex tables solely relying on the image representation, outperforming the state-of-the-art by 9.7% absolute TEDS score."



Paperid:952
Authors:Junwen Chen, Wentao Bao,, Yu Kong
Title: Group Activity Prediction with Sequential Relational Anticipation Model
Abstract:
In this paper, we propose a novel approach to predict group activities given the beginning frames with incomplete activity executions. Existing action prediction approaches learn to enhance the representation power of the partial observation. However, for group activity prediction, the relation evolution of people’s activity and their positions over time is an important cue for predicting group activity. To this end, we propose a sequential relational anticipation model (SRAM) that summarizes the relational dynamics in the partial observation and progressively anticipates the group representations with rich discriminative information. Our model explicitly anticipates both activity features and positions by two graph auto-encoders, aiming to learn a discriminative group representation for group activity prediction. Experimental results on two popularly used datasets demonstrate that our approach significantly outperforms the state-of-the-art activity prediction methods."



Paperid:953
Authors:Haoran Song, Wenchao Ding, Yuxuan Chen, Shaojie Shen, Michael Yu Wang, Qifeng Chen
Title: PiP: Planning-informed Trajectory Prediction for Autonomous Driving
Abstract:
It is critical to predict the motion of surrounding vehicles for self-driving planning, especially in a socially compliant and flexible way. However, future prediction is challenging due to the interaction and uncertainty in driving behaviors. We propose planning-informed trajectory prediction (PiP) to tackle the prediction problem in the multi-agent setting. Our approach is differentiated from the traditional manner of prediction, which is only based on historical information and decoupled with planning. By informing the prediction process with the planning of ego vehicle, our method achieves the state-of-the-art performance of multi-agent forecasting on highway datasets. Moreover, our approach enables a novel pipeline which couples the prediction and planning, by conditioning PiP on multiple candidate trajectories of the ego vehicle, which is highly beneficial for autonomous driving in interactive scenarios."



Paperid:954
Authors:Duo Li, Anbang Yao, Qifeng Chen
Title: PSConv: Squeezing Feature Pyramid into One Compact Poly-Scale Convolutional Layer
Abstract:
Despite their strong modeling capacities, Convolutional Neural Networks (CNNs) are often scale-sensitive. For enhancing the robustness of CNNs to scale variance, multi-scale feature fusion from different layers or filters attracts great attention among existing solutions, while the more granular kernel space is overlooked. We bridge this regret by exploiting multi-scale features in a finer granularity. The proposed convolution operation, named Poly-Scale Convolution (PSConv), mixes up a spectrum of dilation rates and tactfully allocates them in the individual convolutional kernels of each filter regarding a single convolutional layer. Specifically, dilation rates vary cyclically along the axes of input and output channels of the filters, aggregating features over a wide range of scales in a neat style. PSConv could be a drop-in replacement of the vanilla convolution in many prevailing CNN backbones, allowing better representation learning without introducing additional parameters and computational complexities. Comprehensive experiments on the ImageNet and MS COCO benchmarks validate the superior performance of PSConv. Code and models are available at https://github.com/d-li14/PSConv"



Paperid:955
Authors:Zhao-Min Chen, Xin Jin, Borui Zhao, Xiu-Shen Wei, Yanwen Guo
Title: Hierarchical Context Embedding for Region-based Object Detection
Abstract:
State-of-the-art two-stage object detectors apply a classifier to a sparse set of object proposals, relying on region-wise features extracted by RoIPool or RoIAlign as inputs. The region-wise features, in spite of aligning well with the proposal locations, may still lack the crucial context information which is necessary for filtering out noisy background detections, as well as recognizing objects possessing no distinctive appearances. To address this issue, we present a simple but effective Hierarchical Context Embedding (HCE) framework, which can be applied as a plug-and-play component, to facilitate the classification ability of a series of region-based detectors by mining contextual cues. Specifically, to advance the recognition of context-dependent object categories, we propose an image-level categorical embedding module which leverages the holistic image-level context to learn object-level concepts. Then, novel RoI features are generated by exploiting hierarchically embedded context information beneath both whole images and interested regions, which are also complementary to conventional RoI features. Moreover, to make full use of our hierarchical contextual RoI features, we propose the early-and-late fusion strategies (\ie, feature fusion and confidence fusion), which can be combined to boost the classification accuracy of region-based detectors. Comprehensive experiments demonstrate that our HCE framework is flexible and generalizable, leading to significant and consistent improvements upon various region-based detectors, including FPN, Cascade R-CNN and Mask R-CNN."



Paperid:956
Authors:Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, Yu Qiao
Title: Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition
Abstract:
Recent studies often exploit Graph Convolutional Network (GCN) to model label dependencies to improve recognition accuracy for multi-label image recognition. However, constructing a graph by counting the label co-occurrence possibilities of the training data may degrade model generalizability, especially when there exist occasional co-occurrence objects in test images. Our goal is to eliminate such bias and enhance the robustness of the learnt features. To this end, we propose an Attention-Driven Dynamic Graph Convolutional Network (ADD-GCN) to dynamically generate a specific graph for each image. ADD-GCN adopts a Dynamic Graph Convolutional Network (D-GCN) to model the relation of content-aware category representations that are generated by a Semantic Attention Module (SAM). Extensive experiments on public multi-label benchmarks demonstrate the effectiveness of our method, which achieves mAPs of 85.2%, 96.0%, and 95.5\% on MS-COCO, VOC2007, and VOC2012, respectively, and outperforms current state-of-the-art methods with a clear margin."



Paperid:957
Authors:Yuliang Guo, Guang Chen, Peitao Zhao, Weide Zhang, Jinghao Miao, Jingao Wang, Tae Eun Choe
Title: Gen-LaneNet: A Generalized and Scalable Approach for 3D Lane Detection
Abstract:
We present a generalized and scalable method, called Gen-LaneNet, to detect 3D lanes from a single image. The method, inspired by the latest state-of-the-art 3D-LaneNet, is a unified framework solving image encoding, spatial transform of features and 3D lane prediction in a single network. However, we propose unique designs for Gen-LaneNet in two folds. First, we introduce a new geometry-guided lane anchor representation in a new coordinate frame and apply a specific geometric transformation to directly calculate real 3D lane points from the network output. We demonstrate that aligning the lane points with the underlying top-view features in the new coordinate frame is critical towards a generalized method in handling unfamiliar scenes. Second, we present a scalable two-stage framework that decouples the learning of image segmentation subnetwork and geometry encoding subnetwork. Compared to 3D-LaneNet, the proposed Gen-LaneNet drastically reduces the amount of 3D lane labels required to achieve a robust solution in real-world applications. Moreover, we release a new synthetic dataset and its construction strategy to encourage the development and evaluation of 3D lane detection methods. In experiments, we conduct extensive ablation study to substantiate the proposed Gen-LaneNet significantly outperforms 3D-LaneNet in average precision(AP) and F-measure."



Paperid:958
Authors:Xin Xiong, Haipeng Xiong, Ke Xian, Chen Zhao, Zhiguo Cao, Xin Li
Title: Sparse-to-Dense Depth Completion Revisited: Sampling Strategy and Graph Construction
Abstract:
Depth completion is a widely studied problem of predicting a dense depth map from a sparse set of measurements and a single RGB image. In this work, we approach this problem by addressing two issues that have been under-researched in the open literature: sampling strategy (data term) and graph construction (prior term). First, instead of the popular random sampling strategy, we suggest that Poisson disk sampling is a much more effective solution to create sparse depth map from a dense version. We experimentally compare a class of quasi-random sampling strategies and demonstrate that an optimized sampling strategy can significantly improve the performance of depth completion for the same number of sparse samples. Second, instead of the traditional square kernel, we suggest that dynamic construction of local neighborhood is a better choice for interpolating the missing values. More specifically, we proposed an end-to-end network with a graph convolution module. Since the neighborhood relationship of 3D points is more effectively exploited by our novel graph convolution module, our approach has achieved not only state-of-the-art results for depth completion of indoor scenes but also better generalization ability than other competing methods. "



Paperid:959
Authors:Kaisiyuan Wang Qianyi Wu Linsen Song Zhuoqian Yang Wayne Wu Chen Qian Ran He Yu Qiao Chen Change Loy
Title: MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation
Abstract:
The synthesis of natural emotional reactions is an essentialcriteria in vivid talking-face video generation. This criteria is nevertheless seldom taken into consideration in previous works due to the absence of a large-scale, high-quality emotional audio-visual dataset. To address this issue, we build the Multi-view Emotional Audio-visual Dataset(MEAD) which is a talking-face video corpus featuring 60 actors and actresses talking with 8 different emotions at 3 different intensity levels. High-quality audio-visual clips are captured at 7 different view angles in a strictly-controlled environment. Together with the dataset, we release an emotional talking-face generation baseline which enables the manipulation of both emotion and its intensity. Our dataset will be made public and could benefit a number of different research fields including conditional generation, cross-modal understanding and expression recognition."



Paperid:960
Authors:Dong-Jin Kim Xiao Sun Jinsoo Choi Stephen Lin In So Kweon
Title: Detecting Human-Object Interactions with Action Co-occurrence Priors
Abstract:
A common problem in human-object interaction (HOI) detection task is that numerous HOI classes have only a small number of labeled examples, resulting in training sets with a long-tailed distribution. The lack of positive labels can lead to low classification accuracy for these classes. Towards addressing this issue, we observe that there exist natural correlations and anti-correlations among human-object interactions. In this paper, we model the correlations as action co-occurrence matrices and present techniques to learn these priors and leverage them for more effective training, especially in rare classes. The utility of our approach is demonstrated experimentally, where the performance of our approach exceeds the state-of-the-art methods on both of the two leading HOI detection benchmark datasets, HICO-Det and V-COCO."



Paperid:961
Authors:Kun Yuan, Quanquan Li, Jing Shao, Junjie Yan
Title: Learning Connectivity of Neural Networks from a Topological Perspective
Abstract:
Seeking effective neural networks is a critical and practical field in deep learning. Besides designing the depth, type of convolution, normalization, and nonlinearities, the topological connectivity of neural networks is also important. Previous principles of rule-based modular design simplify the difficulty of building an effective architecture, but constrain the possible topologies in limited spaces. In this paper, we attempt to optimize the connectivity in neural networks. We propose a topological perspective to represent a network into a complete graph for analysis, where nodes carry out aggregation and transformation of features, and edges determine the flow of information. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. We further attach auxiliary sparsity constraint to the distribution of connectedness, which promotes the learned topology focus on critical connections. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks. Quantitative results of experiments reflect the learned connectivity is superior to traditional rule-based ones, such as random, residual and complete. In addition, it obtains significant improvements in image classification and object detection without introducing excessive computation burden."



Paperid:962
Authors:Wei-Ting Chen, Hao-Yu Fang, Jian-Jiun Ding, Cheng-Che Tsai, Sy-Yen Kuo
Title: JSTASR: Joint Size and Transparency-Aware Snow Removal Algorithm Based on Modified Partial Convolution and Veiling Effect Removal
Abstract:
Snow removal usually affects the performance of computer vision. Comparing with other atmospheric phenomenon (e.g., haze and rain), snow is more complicated due to its transparency, various size, and accumulation of veiling effect, which make single image de-snowing more challenging. In this paper, first, we reformulate the snow model. Different from that in the previous works, in the proposed snow model, the veiling effect is included. Second, a novel joint size and transparency-aware snow removal algorithm called JSTASR is proposed. It can classify snow particles according to their sizes and conduct snow removal in different scales. Moreover, to remove the snow with different transparency, the transparency-aware snow removal is developed. It can address both transparent and non-transparent snow particles by applying the modified partial convolution. Experiments show that the proposed method achieves significant improvement on both synthetic and real-world datasets and is very helpful for object detection on snow images."



Paperid:963
Authors:Zhipeng Zhang, Houwen Peng, Jianlong Fu Bing Li, Weiming Hu
Title: Ocean: Object-aware Anchor-free Tracking
Abstract:
Anchor-based Siamese trackers have achieved remarkable advancements in accuracy, yet the further improvement is restricted by the lagged tracking robustness. We find the underlying reason is that the regression network in anchor-based methods is only trained on the positive anchor boxes ($IoU \geq 0.6$). This mechanism makes it difficult to refine the anchors whose overlap with the target objects are small. In this paper, we propose a novel object-aware anchor-free network to address this issue. First, instead of refining the reference anchor boxes, we directly predict the position and scale of target objects in an anchor-free fashion. Since each pixel in the groundtruth box is well trained, the tracker is capable of rectifying the weak predictions of target objects during inference. Second, we introduce a feature alignment module to learn an object-aware feature which corresponds with the predicted bounding box. The object-aware feature can further contribute to the classification of target object and background. Moreover, we present a novel Siamese tracking framework based on the anchor-free model. The experiments show that our anchor-free tracker achieves state-of-the-art performance on five benchmarks, including VOT-2018, VOT-2019, OTB-100, GOT-10k and LaSOT. The source code is available at https://github.com/researchmm/TracKit."



Paperid:964
Authors:Yuan Liu, Ruoteng Li, Yu Cheng, Robby T. Tan, Xiubao Sui
Title: Object Tracking using Spatio-Temporal Networks for Future Prediction Location
Abstract:
We introduce an object tracking algorithm that predicts the future locations of the target object and assists the tracker to handle object occlusion. Given a few frames of an object that are extracted from a complete input sequence, we aim to predict the object’s location in the future frames. To facilitate the future prediction ability, we follow three key observations: 1) object motion trajectory is affected significantly by camera motion; 2) the past trajectory of an object can act as a salient cue to estimate the object motion in the spatial domain; 3) previous frames contain the surroundings and appearance of the target object, which is useful for predicting the target object’s future locations. We incorporate these three observations into our method that employs a multi-stream convolutional-LSTM network. By combining the heatmap scores from our tracker (that utilises appearance inference) and the locations of the target object from our trajectory inference, we predict the final target’s location in each frame. Comprehensive evaluations show that our method sets new state-of-the-art performance on a few commonly used tracking benchmarks."



Paperid:965
Authors:Yue Wang, Alireza Fathi, Abhijit Kundu, David A. Ross, Caroline Pantofaru, Tom Funkhouser, Justin Solomon
Title: Pillar-based Object Detection for Autonomous Driving
Abstract:
We present a simple and flexible object detection framework optimized for autonomous driving. Building on the observation that point clouds in this application are extremely sparse, we propose a practical pillar-based approach to fix the imbalance issue caused by anchors. In particular, our algorithm incorporates a cylindrical projection into multi-view feature learning, predicts bounding box parameters per pillar rather than per point or per anchor, and includes an aligned pillar-to-point projection module to improve the final prediction. Our anchor-free approach avoids hyperparameter search associated with past methods, simplifying 3D object detection while significantly improving upon state-of-the-art. "



Paperid:966
Authors:Yanbo Fan, Baoyuan Wu, Tuanhui Li, Yong Zhang, Mingyang Li, Zhifeng Li, Yujiu Yang
Title: Sparse Adversarial Attack via Perturbation Factorization
Abstract:
This work studies the sparse adversarial attack, which aims to generate adversarial perturbations onto partial positions of one benign image, such that the perturbed image is incorrectly predicted by one deep neural network (DNN) model. The sparse adversarial attack involves two challenges, i.e., where to perturb, and how to determine the perturbation magnitude. Many existing works determined the perturbed positions manually or heuristically, and then optimized the magnitude using a proper algorithm designed for the dense adversarial attack. In this work, we propose to factorize the perturbation at each pixel to the product of two variables, including the perturbation magnitude and one binary selection factor (i.e., 0 or 1). One pixel is perturbed if its selection factor is 1, otherwise not perturbed. Based on this factorization, we formulate the sparse attack problem as a mixed integer programming (MIP) to jointly optimize the binary selection factors and continuous perturbation magnitudes of all pixels, with a cardinality constraint on selection factors to explicitly control the degree of sparsity. Besides, the perturbation factorization provides the extra flexibility to incorporate other meaningful constraints on selection factors or magnitudes to achieve some desired performance, such as the group-wise sparsity or the enhanced visual imperceptibility. We develop an efficient algorithm by equivalently reformulating the MIP problem as a continuous optimization problem. Experiments on benchmark databases demonstrate the superiority of the proposed method over several state-of-the-art sparse attack methods. "



Paperid:967
Authors:Maximilian Denninger, Rudolph Triebel
Title: 3D Scene Reconstruction from a Single Viewport
Abstract:
We present a novel approach to infer volumetric reconstructions from a single viewport, based only on an RGB image and a reconstructed normal image. To overcome the problem of reconstructing regions in 3D that are occluded in the 2D image, we propose to learn this information from synthetically generated high-resolution data. To do this, we introduce a deep network architecture that is specifically designed for volumetric TSDF data by featuring a specific tree net architecture. Our framework can handle a 3D resolution of $512^3$ by introducing a dedicated compression technique based on a modified autoencoder. Furthermore, we introduce a novel loss shaping technique for 3D data that guides the learning process towards regions where free and occupied space are close to each other. As we show in experiments on synthetic and realistic benchmark data, this leads to very good reconstruction results, both visually and in terms of quantitative measures. "



Paperid:968
Authors:Seonguk Seo, Yumin Suh, Dongwan Kim, Geeho Kim, Jongwoo Han, Bohyung Han
Title: Learning to Optimize Domain Specific Normalization for Domain Generalization
Abstract:
We propose a simple but effective multi-source domain generalization technique based on deep neural networks by incorporating optimized normalization layers that are specific to individual domains. Our approach employs multiple normalization methods while learning separate affine parameters per domain. For each domain, the activations are normalized by a weighted average of multiple normalization statistics. The normalization statistics are kept track of separately for each normalization type if necessary. Specifically, we employ batch and instance normalizations in our implementation to identify the best combination of these two normalization methods in each domain. The optimized normalization layers are effective to enhance the generalizability of the learned model. We demonstrate the state-of-the-art accuracy of our algorithm in the standard domain generalization benchmarks, as well as viability to further tasks such as multi-source domain adaptation and domain generalization in the presence of label noise."



Paperid:969
Authors:Ye Yu, Abhimitra Meka, Mohamed Elgharib, Hans-Peter Seidel, Christian Theobalt, William A. P. Smith
Title: Self-supervised Outdoor Scene Relighting
Abstract:
Outdoor scene relighting is a challenging problem that requires good understanding of the scene geometry, illumination and albedo. Current techniques are completely supervised, requiring high quality synthetic renderings to train a solution. Such renderings are synthesized using priors learned from limited data. In contrast, we propose a self-supervised approach for relighting. Our approach is trained only on corpora of images collected from the internet without any user-supervision. This virtually endless source of training data allows training a general relighting solution. Our approach first decomposes an image into its albedo, geometry and illumination. A novel relighting is then produced by modifying the illumination parameters. Our solution capture shadow using a dedicated shadow prediction map, and does not rely on accurate geometry estimation. We evaluate our technique subjectively and objectively using a new dataset with ground-truth relighting. Results show the ability of our technique to produce photo-realistic and physically plausible results, that generalizes to unseen scenes. "



Paperid:970
Authors:Mikiya Shibuya, Shinya Sumikura, Ken Sakurada
Title: Privacy Preserving Visual SLAM
Abstract:
This study proposes a privacy-preserving Visual SLAM framework for estimating camera poses and performing bundle adjustment with mixed line and point clouds in real time. Previous studies have proposed localization methods to estimate a camera pose using a line-cloud map for a single image or a reconstructed point cloud. These methods offer a scene privacy protection against the inversion attacks by converting a point cloud to a line cloud, which reconstruct the scene images from the point cloud. However, they are not directly applicable to a video sequence because they do not address computational efficiency. This is a critical issue to solve for estimating camera poses and performing bundle adjustment with mixed line and point clouds in real time. Moreover, there has been no study on a method to optimize a line-cloud map of a server with a point cloud reconstructed from a client video because any observation points on the image coordinates are not available to prevent the inversion attacks, namely the reversibility of the 3D lines. The experimental results with synthetic and real data show that our Visual SLAM framework achieves the intended privacy-preserving formation and real-time performance using a line-cloud map."



Paperid:971
Authors:Valentina Sanguineti, Pietro Morerio, Niccolò Pozzetti, Danilo Greco, Marco Cristani, Vittorio Murino
Title: Leveraging Acoustic Images for Effective Self-Supervised Audio Representation Learning
Abstract:
In this paper, we propose the use of a new modality characterized by a richer information content, namely acoustic images, for the sake of audio-visual scene understanding. Each pixel in such images is characterized by a spectral signature, associated to a specific direction in space and obtained by processing the audio signals coming from an array of microphones. By coupling such array with a video camera, we obtain spatio-temporal alignment of acoustic images and video frames. This constitutes a powerful source of self-supervision, which can be exploited in the learning pipeline we are proposing, without resorting to expensive data annotations. However, since 2D planar arrays are cumbersome and not as widespread as ordinary microphones, we propose that the richer information content of acoustic images can be distilled, through a self-supervised learning scheme, into more powerful audio and visual feature representations. The learnt feature representations can then be employed for downstream tasks such as classification and cross-modal retrieval, without the need of a microphone array. To prove that, we introduce a novel multimodal dataset consisting in RGB videos, raw audio signals and acoustic images, aligned in space and synchronized in time. Experimental results demonstrate the validity of our hypothesis and the effectiveness of the proposed pipeline, also when tested for tasks and datasets different from those used for training."



Paperid:972
Authors:Yanbei Chen, Loris Bazzani
Title: Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval
Abstract:
Interactive image retrieval is an emerging research topic with the objective of integrating inputs from multiple modalities as query for retrieval, e.g., textual feedback from users to guide, modify or refine image retrieval. In this work, we study the problem of composing images and textual modifications for language-guided retrieval in the context of fashion applications. We propose a unified Joint Visual Semantic Matching (JVSM) model that learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discriminative embedding space via compositional losses. JVSM has been designed with versatility and flexibility in mind, being able to perform multiple image and text tasks in a single model, such as text-image matching and language-guided retrieval. We show the effectiveness of our approach in the fashion domain, where it is difficult to express keyword-based queries given the complex specificity of fashion terms. Our experiments on three datasets (Fashion-200k, UT-Zap50k, and Fashion-iq) show that JVSM achieves state-of-the-art results on language-guided retrieval and additionally we show its capabilities to perform image and text retrieval. "



Paperid:973
Authors:Haoang Li, Pyojin Kim, Ji Zhao, Kyungdon Joo, Zhipeng Cai, Zhe Liu , Yun-Hui Liu
Title: Globally Optimal and Efficient Vanishing Point Estimation in Atlanta World
Abstract:
Atlanta world holds for the scenes composed of a vertical dominant direction and several horizontal dominant directions. Vanishing point (VP) is the intersection of the image lines projected from parallel 3D lines. In Atlanta world, given a set of image lines, we aim to cluster them by the unknown-but-sought VPs whose number is unknown. Existing approaches are prone to missing partial inliers, rely on prior knowledge of the number of VPs, and/or lead to low efficiency. To overcome these limitations, we propose the novel mine-and-stab (MnS) algorithm and embed it in the branch-and-bound (BnB) algorithm. Different from BnB that iteratively branches the full parameter intervals, our MnS directly mines the narrow sub-intervals and then stabs them by probes. We simultaneously search for the vertical VP by BnB and horizontal VPs by MnS. The proposed collaboration between BnB and MnS guarantees global optimality in terms of maximizing the number of inliers. It can also automatically determine the number of VPs. Moreover, its efficiency is suitable for practical applications. Experiments on synthetic and real-world datasets showed that our method outperforms state-of-the-art approaches in terms of accuracy and/or efficiency."



Paperid:974
Authors:Yuri Viazovetskyi, Vladimir Ivashkin, Evgeny Kashin
Title: StyleGAN2 Distillation for Feed-forward Image Manipulation
Abstract:
StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors. Editing existing image requires embedding a given image into the latent space of StyleGAN2. Latent code optimization via backpropagation is commonly used for qualitative embedding of real world images, although it is prohibitively slow for many applications. We propose a way to distill a particular image manipulation of StyleGAN2 into image-to-image network trained in paired way. The resulting pipeline is an alternative to existing GANs, trained on unpaired data. We provide results of human faces’ transformation: gender swap, aging/rejuvenation, style transfer and image morphing. We show that the quality of generation using our method comparable to StyleGAN2 backpropagation and current state-of-the-art methods in these particular tasks."



Paperid:975
Authors:Jinxian Liu, Minghui Yu, Bingbing Ni⁴, Ye Chen
Title: Self-Prediction for Joint Instance and Semantic Segmentation of Point Clouds
Abstract:
We develop a novel learning scheme named Self-Prediction for 3D instance and semantic segmentation of point clouds. Distinct from most existing methods that focus on designing convolutional operators, our method designs a new learning scheme to enhance point relation exploring for better segmentation. More specifically, we divide a point cloud sample into two subsets and construct a complete graph based on their representations. Then we use label propagation algorithm to predict labels of one subset when given labels of the other subset. By training with this Self-Prediction task, the backbone network is constrained to fully explore relational context/geometric/shape information and learn more discriminative features for segmentation. Moreover, a general associated framework equipped with our Self-Prediction scheme is designed for enhancing instance and semantic segmentation simultaneously, where instance and semantic representations are combined to perform Self-Prediction. Through this way, instance and semantic segmentation are collaborated and mutually reinforced. Significant performance improvements on instance and semantic segmentation compared with baseline are achieved on S3DIS and ShapeNet. Our method achieves state-of-the-art instance segmentation results on S3DIS and comparable semantic segmentation results compared with state-of-the-arts on S3DIS and ShapeNet when we only take PointNet++ as the backbone network."



Paperid:976
Authors:Eduardo Hugo Sanchez, Mathieu Serrurier, Mathias Ortner
Title: Learning Disentangled Representations via Mutual Information Estimation
Abstract:
In this paper, we investigate the problem of learning disentangled representations. Given a pair of images sharing some attributes, we aim to create a low-dimensional representation which is split into two parts: a shared representation that captures the common information between the images and an exclusive representation that contains the specific information of each image. To address this issue, we propose a model based on mutual information estimation without relying on image reconstruction or image generation. Mutual information maximization is performed to capture the attributes of data in the shared and exclusive representations while we minimize the mutual information between the shared and exclusive representation to enforce representation disentanglement. We show that these representations are useful to perform downstream tasks such as image classification and image retrieval based on the shared or exclusive component. Moreover, classification results show that our model outperforms the state-of-the-art models based on VAE/GAN approaches in representation disentanglement."



Paperid:977
Authors:Chenglong Li, Lei Liu, Andong Lu, Qing Ji, Jin Tang
Title: Challenge-Aware RGBT Tracking
Abstract:
RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role to represent the target appearance in RGBT tracking. In this paper, we propose a novel challenge-aware neural network to handle the modality-shared challenges (e.g., fast motion, scale variation and occlusion) and the modality-specific ones (e.g., illumination variation and thermal crossover) for RGBT tracking. In particular, we design several parameter-shared branches in each layer to model the target appearance under the modality-shared challenges, and several parameter-independent branches under the modality-specific ones. Based on the observation that the modality-specific cues of different modalities usually contains the complementary advantages, we propose a guidance module to transfer discriminative features from one modality to another one, which could enhance the discriminative ability of some weak modality. Moreover, all branches are aggregated together in an adaptive manner and parallel embedded in the backbone network to efficiently form more discriminative target representations. These challenge-aware branches are able to model the target appearance under certain challenges so that the target representations can be learnt by a few parameters even in the situation of insufficient training data. From the experimental results we will show that our method operates at a real-time speed while performing well against the state-of-the-art methods on three benchmark datasets."



Paperid:978
Authors:Bruno Lecouat, Jean Ponce, Julien Mairal
Title: Fully Trainable and Interpretable Non-Local Sparse Models for Image Restoration
Abstract:
Non-local self-similarity and sparsity principles have proven to be powerful priors for natural image modeling. We propose a novel differentiable relaxation of joint sparsity that exploits both principles and leads to a general framework for image restoration which is (1) trainable end to end, (2) fully interpretable, and (3) much more compact than competing deep learning architectures. We apply this approach to denoising, jpeg deblocking, and demosaicking, and show that, with as few as 100K parameters, its performance on several standard benchmarks is on par or better than state-of-the-art methods that may have an order of magnitude or more parameters."



Paperid:979
Authors:Harkirat Singh Behl, Atilim Güneş Baydin, Ran Gal, Philip H.S. Torr, Vibhav Vineet
Title: AutoSimulate: (Quickly) Learning Synthetic Data Generation
Abstract:
Simulation is increasingly being used for generating large labelled datasets in many machine learning problems. Recent methods have focused on adjusting simulator parameters with the goal of maximising accuracy on a validation task, usually relying on REINFORCE-like gradient estimators. However these approaches are very expensive as they treat the entire data generation, model training, and validation pipeline as a black-box and require multiple costly objective evaluations at each iteration. We propose an efficient alternative for optimal synthetic data generation, based on a novel differentiable approximation of the objective. This allows us to optimize the simulator, which may be non-differentiable, requiring only one objective evaluation at each iteration with a little overhead. We demonstrate on a state-of-the-art photorealistic renderer that the proposed method finds the optimal data distribution faster (up to 50 times), with significantly reduced training data generation and better accuracy than previous methods."



Paperid:980
Authors:Xiaotong Luo, Yuan Xie, Yulun Zhang, Yanyun Qu, Cuihua Li, Yun Fu
Title: LatticeNet: Towards Lightweight Image Super-resolution with Lattice Block
Abstract:
Deep neural networks with a massive number of layers have made a remarkable breakthrough on single image super-resolution (SR), but sacrifice computation complexity and memory storage. To address this problem, we focus on the lightweight models for fast and accurate image SR. Due to the frequent use of residual block (RB) in SR models, we pursue an economical structure to adaptively combine RBs. Drawing lessons from lattice filter bank, we design the lattice block (LB) in which two butterfly structures are applied to combine two RBs. LB has the potential of various linear combinations of two RBs. Each case of LB depends on the combination coefficients which are determined by the attention mechanism. LB favors the lightweight SR model with the reduction of about half amount of the parameters while keeping the similar SR performance. Moreover, we propose a lightweight SR model, LatticeNet, which uses series connection of LBs and the backward feature fusion. Extensive experiments demonstrate that our proposal can achieve superior accuracy on four available benchmark datasets against other state-of-the-art methods, while maintaining relatively low computation and memory requirements."



Paperid:981
Authors:M.Naseer Subhani, Mohsen Ali
Title: Learning from Scale-Invariant Examples for Domain Adaptation in Semantic Segmentation
Abstract:
Self-supervised learning approaches for unsupervised domain adaptation (UDA) of semantic segmentation models suffer from challenges of predicting and selecting reasonable good quality pseudo labels.In this paper, we propose a novel approach of exploiting scale-invariance property of the semantic segmentation model for self-supervised domain adaptation. Our algorithm is based on a reasonable assumption that, in general, regardless of the size of the object and stuff (given context) the semantic labeling should be unchanged. We show that this constraint is violated over the images of the target domain, and hence could be used to transfer labels in-between differently scaled patches. Specifically, we show that semantic segmentation model produces output with high entropy when presented with scaled-up patches of target domain, in comparison to when presented original size images. These scale-invariant examples are extracted from the most confident images of the target domain. Dynamic class specific entropy thresholding mechanism is presented to filter out unreliable pseudo-labels. Furthermore, we also incorporate the focal loss to tackle the problem of class imbalance in self-supervised learning. Extensive experiments have been performed, and results indicate that exploiting the scale-invariant labeling, we outperform existing self-supervised based state-of-the-art domain adaptation methods. Specifically, we achieve 1.3% and 3.8% of lead for GTA5 to Cityscapes and SYNTHIA to Cityscapes with VGG16-FCN8 baseline network."



Paperid:982
Authors:Hanqing Wang, Wenguan Wang, Tianmin Shu, Wei Liang, Jianbing Shen
Title: Active Visual Information Gathering for Vision-Language Navigation
Abstract:
Vision-language navigation (VLN) is the task of entailing an agent to carry out navigational instructions inside photo-realistic environments. One of the key challenges in VLN is how to conduct a robust navigation by mitigating the uncertainty caused by ambiguous instructions and insufficient observation of the environment. Agents trained by current approaches typically suffer from this and would consequently struggle to avoid random and inefficient actions at every step. In contrast, when humans face such a challenge, they can still maintain robust navigation by actively exploring the surroundings to gather more information and thus make more confident navigation decisions. This work draws inspiration from human navigation behavior and endows an agent with an active information gathering ability for a more intelligent vision-language navigation policy. To achieve this, we propose an end-to-end framework for learning an exploration policy that decides i) when and where to explore, ii) what information is worth gathering during exploration, and extbf{iii)} how to adjust the navigation decision after the exploration. The experimental results show promising exploration strategies emerged from training, which leads to significant boost in navigation performance. On the R2R challenge leaderboard, our agent gets promising results all three VLN settings, i.e., single run, pre-exploration, and beam search."



Paperid:983
Authors:Yancong Lin, Silvia L. Pintea, Jan C. van Gemert
Title: Deep Hough-Transform Line Priors
Abstract:
Classical work on line segment detection is knowledge-based; it uses carefully designed geometric priors using either image gradients, pixel groupings, or Hough transform variants. Instead, current deep learning methods do away with all prior knowledge and replace priors by training deep networks on large manually annotated datasets. Here, we reduce the dependency on labeled data by building on the classic knowledge-based priors while using deep networks to learn features. We add line priors through a trainable Hough transform block into a deep network. Hough transform provides the prior knowledge about global line parameterizations, while the convolutional layers can learn the local gradient-like line features. On the Wireframe (ShanghaiTech) and York Urban datasets we show that adding prior knowledge improves data efficiency as line priors no longer need to be learned from data."



Paperid:984
Authors:Keyang Zhou, Bharat Lal Bhatnagar, Gerard Pons-Moll
Title: Unsupervised Shape and Pose Disentanglement for 3D Meshes
Abstract:
Parametric models of humans, faces, hands and animals have been widely used for a range of tasks such as image-based reconstruction, shape correspondence estimation, and animation. Their key strength is the ability to factor surface variations into shape and pose dependent components. Learning such models requires lots of expert knowledge and hand-defined object-specific constraints, making the learning approach unscalable to novel objects. In this paper, we present a simple yet effective approach to learn disentangled shape and pose representations in an unsupervised setting. We use a combination of self-consistency and cross-consistency constraints to learn pose and shape space from registered meshes. We additionally incorporate as-rigid-as-possible deformation(ARAP) into the training loop to avoid degenerate solutions. We demonstrate the usefulness of learned representations through a number of tasks including pose transfer and shape retrieval. The experiments on datasets of 3D humans, faces, hands and animals demonstrate the generality of our approach. Code is made available at https://virtualhumans.mpi-inf.mpg.de/unsup_shape_pose/."



Paperid:985
Authors:Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, Seung-Ik Lee
Title: CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection
Abstract:
Learning to detect real-world anomalous events through video-level labels is a challenging task due to the rare occurrence of anomalies as well as noise in the labels. In this work, we propose a weakly supervised anomaly detection method which has manifold contributions including 1) a random batch based training procedure to reduce inter-batch correlation, 2) a normalcy suppression mechanism to minimize anomaly scores of the normal regions of a video by taking into account the overall information available in one training batch, and 3) a clustering distance based loss to contribute towards mitigating the label noise and to produce better anomaly representations by encouraging our model to generate distinct normal and anomalous clusters. The proposed method obtains 83.03% and 89.67% frame-level AUC performance on the UCF-Crime and ShanghaiTech datasets respectively, demonstrating its superiority over the existing state-of-the-art algorithms."



Paperid:986
Authors:Ning Yu, Ke Li, Peng Zhou Jitendra Malik, Larry Davis, Mario Fritz
Title: Inclusive GAN: Improving Data and Minority Coverage in Generative Models
Abstract:
Generative Adversarial Networks (GANs) have brought about rapid progress towards generating photorealistic images. Yet the equitable allocation of their modeling capacity among subgroups has received less attention, which could lead to potential biases against underrepresented minorities if left uncontrolled. In this work, we first formalize the problem of minority inclusion as one of data coverage, and then propose to improve data coverage by harmonizing adversarial training with reconstructive generation. The experiments show that our method outperforms the existing state-of-the-art methods in terms of data coverage on both seen and unseen data. We develop an extension that allows explicit control over the minority subgroups that the model should ensure to include, and validate its effectiveness at little compromise from the overall performance on the entire dataset. Code, models, and supplemental videos are available at GitHub."



Paperid:987
Authors:Evangelos Ntavelis, Andrés Romero, Iason Kastanis, Luc Van Gool, Radu Timofte
Title: SESAME: Semantic Editing of Scenes by Adding, Manipulating or Erasing Objects
Abstract:
Manipulating or Erasing Objects","Recent advances in image generation gave rise to powerful tools for semantic image editing. However, existing approaches can either operate on a single image or require an abundance of additional information. They are not capable of handling the complete set of editing operations, that is addition, manipulation or removal of semantic concepts. To address these limitations, we propose SESAME, a novel generator-discriminator pair for Semantic Editing of Scenes by Adding, Manipulating or Erasing objects. In our setup, the user provides the semantic labels of the areas to be edited and the generator synthesizes the corresponding pixels. In contrast to previous methods that employ a discriminator that concatenates semantics and image as an input, the SESAME discriminator is composed of two input streams that independently process the image and its semantics, using the latter to manipulate the results of the former. We evaluate our model on a diverse set of datasets and report state-of-the-art performance on two tasks: (a) image manipulation and (b) image generation conditioned on semantic labels."



Paperid:988
Authors:Ran Chen, Yong Liu, Mengdan Zhang, Shu Liu, Bei Yu, Yu-Wing Tai
Title: Dive Deeper Into Box for Object Detection
Abstract:
Anchor free methods have defined the new frontier in state-of-the-art researches in object detection in which accurate bounding box estimation is the key to the success of these methods. However, even the bounding box has the highest confidence score, it is still far from perfect at localization. This motivates us to investigate a box reorganization method (DDBNet), which can dive deeper into the box to strive for more accurate localization. Specifically, boxes are manipulated via a surgical operation named D&R, which represents box decomposition and recombination toward tightening instances more precisely. It should be noted that this D&R operation is manipulated at the IoU loss. Experimental results show that our method is effective which leads to state-of-the-art performance for one stage object detection."



Paperid:989
Authors:Bingyan Liao, Chenye Wang, Yayun Wang, Yaonong Wang, Jun Yin
Title: PG-Net: Pixel to Global Matching Network for Visual Tracking
Abstract:
Siamese neural network has been well investigated by tracking frameworks due to its fast speed and high accuracy. However, very few efforts were spent on background-extraction by those approaches. In this paper, a Pixel to Global Matching Network (PG-Net) is proposed to suppress the influence of background in search image while achieving state-of-the-art tracking performance. To achieve this purpose, each pixel on search feature is utilized to calculate the similarity with global template feature. This calculation method can appropriately reduce the matching area, thus introducing less background interference. In addition, we propose a new tracking framework to perform correlation-shared tracking and multiple losses for training, which not only reduce the computational burden but also improve the performance. We conduct comparison experiments on various public tracking datasets, which obtains state-of-the-art performance while running with fast speed."



Paperid:990
Authors:Taimoor Tariq, Okan Tarhan Tursun, Munchurl Kim, Piotr Didyk
Title: Why Are Deep Representations Good Perceptual Quality Features?
Abstract:
Recently, intermediate feature maps of pre-trained convolutional neural networks have shown significant perceptual quality improvements, when they are used in the loss function for training new networks. It is believed that these features are better at encoding the perceptual quality and provide more efficient representations of input images compared to other perceptual metrics such as SSIM and PSNR. However, there have been no systematic studies to determine the underlying reason. Due to the lack of such an analysis, it is not possible to evaluate the performance of a particular set of features or to improve the perceptual quality even more by carefully selecting a subset of features from a pre-trained CNN. This work shows that the capabilities of pre-trained deep CNN features in optimizing the perceptual quality are correlated with their success in capturing basic human visual perception characteristics. In particular, we focus our analysis on fundamental aspects of human perception, such as the contrast sensitivity and orientation selectivity. We introduce two new formulations to measure the frequency and orientation selectivity of the features learned by convolutional layers for evaluating deep features learned by widely-used deep CNNs such as VGG-16. We demonstrate that the pre-trained CNN features which receive higher scores are better at predicting human quality judgment. Furthermore, we show the possibility of using our method to select deep features to form a new loss function, which improves the image reconstruction quality for the well-known single-image super-resolution problem. "



Paperid:991
Authors:Aoxiang Fan, Xingyu Jiang, Yang Wang, Junjun Jiang, Jiayi Ma
Title: Geometric Estimation via Robust Subspace Recovery
Abstract:
Geometric estimation from image point correspondences is the core procedure of many 3D vision problems, which is prevalently accomplished by random sampling techniques. In this paper, we consider the problem from an optimization perspective, to exploit the intrinsic linear structure of point correspondences to assist estimation. We generalize the conventional method to a robust one and extend the previous analysis for linear structure to develop several new algorithms. The proposed solutions essentially address the estimation problem by solving a subspace recovery problem to identify the inliers. Experiments on real-world image datasets for both fundamental matrix and homography estimation demonstrate the superiority of our method over the state-of-the-art in terms of both robustness and accuracy."



Paperid:992
Authors:Sanath Narayan, Akshita Gupta, Fahad Shahbaz Khan, Cees G. M. Snoek, Ling Shao
Title: Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification
Abstract:
Zero-shot learning strives to classify unseen categories for which no data is available during training. In the generalized variant, the test samples can further belong to seen or unseen categories. The state-of-the-art relies on Generative Adversarial Networks that synthesize unseen class features by leveraging class-specific semantic embeddings. During training, they generate semantically consistent features, but discard this constraint during feature synthesis and classification. We propose to enforce semantic consistency at all stages of (generalized) zero-shot learning: training, feature synthesis and classification. We first introduce a feedback loop, from a semantic embedding decoder, that iteratively refines the generated features during both the training and feature synthesis stages. The synthesized features together with their corresponding latent embeddings from the decoder are then transformed into discriminative features and utilized during classification to reduce ambiguities among categories. Experiments on (generalized) zero-shot object and action classification reveal the benefit of semantic consistency and iterative feedback, outperforming existing methods on six zero-shot learning benchmarks. Source code at https://github.com/akshitac8/tfvaegan."



Paperid:993
Authors:Yujing Lou, Yang You, Chengkun Li, Zhoujun Cheng, Liangwei Li, Lizhuang Ma, Weiming Wang, Cewu Lu
Title: Human Correspondence Consensus for 3D Object Semantic Understanding
Abstract:
Semantic understanding of 3D objects is crucial in many applications such as object manipulation. However, it is hard to give a universal definition of point-level semantics that everyone would agree on. We observe that people have a consensus on semantic correspondences between two areas from different objects, but are less certain about the exact semantic meaning of each area. Therefore, we argue that by providing human labeled correspondences between different objects from the same category instead of explicit semantic labels, one can recover rich semantic information of an object. In this paper, we introduce a new dataset named CorresPondenceNet. Based on this dataset, we are able to learn dense semantic embeddings with a novel geodesic consistency loss. Accordingly, several state-of-the-art networks are evaluated on this correspondence benchmark. We further show that CorresPondenceNet could not only boost fine-grained understanding of heterogeneous objects but also cross-object registration and partial object matching."



Paperid:994
Authors:Jiwei Chen, Yubao Sun, Qingshan Liu, Rui Huang
Title: Learning Memory Augmented Cascading Network for Compressed Sensing of Images
Abstract:
In this paper, we propose a cascading network for compressed sensing of images with progressive reconstruction. Specifically, we decompose the complex reconstruction mapping into the cascade of incremental detail reconstruction (IDR) modules and measurement residual updating (MRU) modules. The IDR module is designed to reconstruct the remaining details from the residual measurement vector, and MRU is employed to update the residual measurement vector and feed it into the next IDR module. The contextual memory module is introduced to augment the capacity of IDR modules, therefore facilitating the information interaction among all the IDR modules. The final reconstruction is calculated by accumulating the outputs of all the IDR modules. Extensive experiments on natural images and magnetic resonance images demonstrate the proposed method achieves better performance against the state-of-the-art methods."



Paperid:995
Authors:Dizhong Zhu, William A. P. Smith
Title: Least squares surface reconstruction on arbitrary domains
Abstract:
Almost universally in computer vision, when surface derivatives are required, they are computed using only first order accurate finite difference approximations. We propose a new method for computing numerical derivatives based on 2D Savitzky-Golay filters and K-nearest neighbour kernels. The resulting derivative matrices can be used for least squares surface reconstruction over arbitrary (even disconnected) domains in the presence of large noise and allowing for higher order polynomial local surface approximations. They are useful for a range of tasks including normal-from-depth (i.e.~surface differentiation), height-from-normals (i.e.~surface integration) and shape-from-x. We show how to write both orthographic or perspective height-from-normals as a linear least squares problem using the same formulation and avoiding a nonlinear change of variables in the perspective case. We demonstrate improved performance relative to state-of-the-art across these tasks on both synthetic and real data and make available an open source implementation of our method."



Paperid:996
Authors:My Kieu, Andrew D. Bagdanov, Marco Bertini, Alberto del Bimbo
Title: Task-conditioned Domain Adaptation for Pedestrian Detection in Thermal Imagery
Abstract:
Pedestrian detection is a core problem in computer vision that sees broad application in video surveillance and, more recently, in advanced driving assistance systems. Despite its broad application and interest, it remains a challenging problem in part due to the vast range of conditions under which it must be robust. Pedestrian detection at nighttime and during adverse weather conditions is particularly challenging, which is one of the reasons why thermal and multispectral approaches have been become popular in recent years. In this paper, we propose a novel approach to domain adaptation that significantly improves pedestrian detection performance in the thermal domain. The key idea behind our technique is to adapt an RGB-trained detection network to simultaneously solve two related tasks. An auxiliary classification task that distinguishes between daytime and nighttime thermal images is added to the main detection task during domain adaptation. The internal representation learned to perform this classification task is used to condition a YOLOv3 detector at multiple points in order to improve its adaptation to the thermal domain. We validate the effectiveness of task-conditioned domain adaptation by comparing with the state-of-the-art on the KAIST Multispectral Pedestrian Detection Benchmark. To the best of our knowledge, our proposed task-conditioned approach achieves the best single-modality detection results."



Paperid:997
Authors:Junhua Zou, Zhisong Pan, Junyang Qiu, Xin Liu, Ting Rui, Wei Li
Title: Improving the Transferability of Adversarial Examples with Resized-Diverse-Inputs, Diversity-Ensemble and Region Fitting
Abstract:
Diversity-Ensemble and Region Fitting","We introduce a three stage pipeline: resized-diverse-inputs (RDIM), diversity-ensemble (DEM) and region fitting, that work together to generate transferable adversarial examples. We first explore the internal relationship between existing attacks, and propose RDIM that is capable of exploiting this relationship. Then we propose DEM, the multi-scale version of RDIM, to generate multi-scale gradients. After the first two steps we transform value fitting into region fitting across iterations. RDIM and region fitting do not require extra running time and these three steps can be well integrated into other attacks. Our best attack fools six black-box defenses with a 93\% success rate on average, which is higher than the state-of-the-art gradient-based attacks. Besides, we rethink existing attacks rather than simply stacking new methods on the old ones to get better performance. It is expected that our findings will serve as the beginning of exploring the internal relationship between attack methods."



Paperid:998
Authors:Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M. Robertson, Yongxin Yang
Title: DADA: Differentiable Automatic Data Augmentation
Abstract:
Data augmentation (DA) techniques aim to increase data variability, and thus train deep networks with better generalisation. The pioneering AutoAugment automated the search for optimal DA policies with reinforcement learning. However, AutoAugment is extremely computationally expensive, limiting its wide applicability. Followup works such as Population Based Augmentation (PBA) and Fast AutoAugment improved efficiency, but their optimization speed remains a bottleneck. In this paper, we propose Differentiable Automatic Data Augmentation (DADA) which dramatically reduces the cost. DADA relaxes the discrete DA policy selection to a differentiable optimization problem via Gumbel-Softmax. In addition, we introduce an unbiased gradient estimator, RELAX, leading to an efficient and effective one-pass optimization strategy to learn an efficient and accurate DA policy. We conduct extensive experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Furthermore, we demonstrate the value of Auto DA in pre-training for downstream detection problems. Results show our DADA is at least one order of magnitude faster than the state-of-the-art while achieving very comparable accuracy. The code is available at https://github.com/VDIGPKU/DADA."



Paperid:999
Authors:Armen Avetisyan, Tatiana Khanova, Christopher Choy, Denver Dash, Angela Dai, Matthias Nießner
Title: SceneCAD: Predicting Object Alignments and Layouts in RGB-D Scans
Abstract:
We present a novel approach to reconstructing lightweight, CAD-based representations of scanned 3D environments from commodity RGB-D sensors. Our key idea is to jointly optimize for both CAD model alignments as well as layout estimations of the scanned scene, explicitly modeling inter-relationships between objects-to-objects and objects-to-layout. Since object arrangement and scene layout are intrinsically coupled, we show that treating the problem jointly significantly helps to produce globally-consistent representations of a scene. Object CAD models are aligned to the scene by establishing dense correspondences between geometry, and we introduce a hierarchical layout prediction approach to estimate layout planes from corners and edges of the scene. To this end, we propose a message-passing graph neural network to model the inter-relationships between objects and layout, guiding generation of a globally object alignment in a scene. By considering the global scene layout, we achieve significantly improved CAD alignments compared to state-of-the-art methods, improving from 41.83% to 58.41% alignment accuracy on SUNCG and from 50.05% to 61.24% on ScanNet, respectively. The resulting CAD-based representations makes our method well-suited for applications in content creation such as augmented- or virtual reality."



Paperid:1000
Authors:Wei Wang, Shaodi You, Theo Gevers
Title: Kinship Identification through Joint Learning using Kinship Verification Ensembles
Abstract:
Kinship verification is a well-explored task: identifying whether or not two persons are kin. In contrast, kinship identification has been largely ignored so far. Kinship identification aims to further identify the particular type of kinship. An extension to kinship verification run short to properly obtain identification, because existing verification networks are individually trained on specific kinships and do not consider the context between different kinship types. Also, existing kinship verification datasets have biased positive-negative distributions which are different than real-world distributions.To this end, we propose a novel kinship identification approach based on joint training of kinship verification ensembles and classification modules. We propose to rebalance the training dataset to become more realistic. Large scale experiments demonstrate the appealing performance on kinship identification. The experiments further show significant performance improvement of kinship verification when trained on the same dataset with more realistic distributions."



Paperid:1001
Authors:Hongje Seong, Junhyuk Hyun, Euntai Kim
Title: Kernelized Memory Network for Video Object Segmentation
Abstract:
Semi-supervised video object segmentation (VOS) is a task that involves predicting a target object in a video when the ground truth segmentation mask of the target object is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising solution for semi-supervised VOS. However, an important point is overlooked when applying STM to VOS. The solution (STM) is non-local, but the problem (VOS) is predominantly local. To solve the mismatch between STM and VOS, we propose a kernelized memory network (KMN). Before being trained on real videos, our KMN is pre-trained on static images, as in previous works. Unlike in previous works, we use the Hide-and-Seek strategy in pre-training to obtain the best possible results in handling occlusions and segment boundary extraction. The proposed KMN surpasses the state-of-the-art on standard benchmarks by a significant margin (+5% on DAVIS 2017 test-dev set). In addition, the runtime of KMN is 0.12 seconds per frame on the DAVIS 2016 validation set, and the KMN rarely requires extra computation, when compared with STM."



Paperid:1002
Authors:Xiaoqi Zhao, Lihe Zhang¹, Youwei Pang, Huchuan Lu, Lei Zhang
Title: A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection
Abstract:
Existing RGB-D salient object detection (SOD) approaches concentrate on the cross-modal fusion between the RGB stream and the depth stream. They do not deeply explore the effect of the depth map itself. In this work, we design a single stream network to directly use the depth map to guide early fusion and middle fusion between RGB and depth, which saves the feature encoder of the depth stream and achieves a lightweight and real-time model. We tactfully utilize depth information from two perspectives: (1) Overcoming the incompatibility problem caused by the great difference between modalities, we build a single stream encoder to achieve the early fusion, which can take full advantage of ImageNet pre-trained backbone model to extract rich and discriminative features. (2) We design a novel depth-enhanced dual attention module (DEDA) to efficiently provide the fore-/back-ground branches with the spatially filtered features, which enables the decoder to optimally perform the middle fusion. Besides, we put forward a pyramidally attended feature extraction module (PAFE) to accurately localize the objects of different scales. Extensive experiments demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics. Furthermore, this model is 55.5\% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 imes 384$ image. "



Paperid:1003
Authors:Tianyi Zhang, Guosheng Lin, Weide Liu, Jianfei Cai, Alex Kot
Title: Splitting vs. Merging: Mining Object Regions with Discrepancy and Intersection Loss for Weakly Supervised Semantic Segmentation
Abstract:
In this paper we focus on the task of weakly-supervised semantic segmentation supervised with image-level labels. Since the pixel-level annotation is not available in the training process, we rely on region mining models to estimate the pseudo-masks from the image-level labels. Thus, in order to improve the final segmentation results, we aim to train a region-mining model which could accurately and completely highlight the target object regions for generating high-quality pseudo-masks. However, the region mining models are likely to only highlight the most discriminative regions instead of the entire objects. In this paper, we aim to tackle this problem from a novel perspective of optimization process. We propose a Splitting vs. Merging optimization strategy, which is mainly composed of the Discrepancy loss and the Intersection loss. The proposed Discrepancy loss aims at mining out regions of different spatial patterns instead of only the most discriminative region, which leads to the splitting effect. The Intersection loss aims at mining the common regions of the different maps, which leads to the merging effect. Our Splitting vs. Merging strategy helps to expand the output heatmap of the region mining model to the object scale. Finally, by training the segmentation model with the masks generated by our Splitting vs Merging strategy, we achieve the state-of-the-art weakly-supervised segmentation results on the Pascal VOC 2012 benchmark."



Paperid:1004
Authors:Chunluan Zhou Zhou Ren Gang Hua
Title: Temporal Keypoint Matching and Refinement Network for Pose Estimation and Tracking
Abstract:
Multi-person pose estimation and tracking in realistic videos is very challenging due to factors such as occlusions, fast motion and pose variations. Top-down approaches are commonly used for this task, which involves three stages: person detection, single-person pose estimation, and pose association across time. Recently, significant progress has been made in person detection and single-person pose estimation. In this paper, we mainly focus on improving pose association and estimation in a video to build a strong pose estimator and tracker. To this end, we propose a novel temporal keypoint matching and refinement network. Specifically, we propose two network modules, temporal keypoint matching and temporal keypoint refinement, which are incorporated into a single-person pose estimatin network. The temporal keypoint matching module learns a simialrity metric for matching keypoints across frames. Pose matching is performed by aggregating keypoint similarities between poses in adjacent frames. The temporal keypoint refinement module serves to correct individual poses by utilizing their associated poses in neighboring frames as temporal context. We validate the effectiveness of our proposed network on two benchmark datasets: PoseTrack 2017 and PoseTrack 2018. Exprimental results show that our approach achieves state-of-the-art performance on both datasets."



Paperid:1005
Authors:Kara-Ali Aliev, Artem Sevastopolsky, Maria Kolos, Dmitry Ulyanov, Victor Lempitsky
Title: Neural Point-Based Graphics
Abstract:
We present a new point-based approach for modeling the appearance of real scenes. The approach uses a raw point cloud as the geometric representation of a scene, and augments each point with a learnable neural descriptor that encodes local geometry and appearance. A deep rendering network is learned in parallel with the descriptors, so that new views of the scene can be obtained by passing the rasterizations of a point cloud from new viewpoints through this network. The input rasterizations use the learned descriptors as point pseudo-colors. We show that the proposed approach can be used for modeling complex scenes and obtaining their photorealistic views, while avoiding explicit surface estimation and meshing. In particular, compelling results are obtained for scenes scanned using hand-held commodity RGB-D sensors as well as standard RGB cameras even in the presence of objects that are challenging for standard mesh-based modeling."



Paperid:1006
Authors:Bin He, Ce Wang, Boxin Shi, Ling-Yu Duan
Title: FHDe²Net: Full High Definition Demoireing Network
Abstract:
Frequency aliasing in the digital capture of display screens leads to the moir´e pattern, appearing as stripe-shaped distortions in images. Efforts to demoir´eing have been made recently in a learning fashion due to the complexity and diversity of the pattern appearance. However, existing methods cannot satisfy the practical demand of demoir´eing on camera phone capturing more pixels than a full high definition (FHD) image, which poses additional challenges of wider pattern scale range and fine detail preservation. We propose the Full High Definition Demoir´eing Network (FHDe$^2$Net) to solve such problems. The framework consists of a global to local cascaded removal branch to eradicate multi-scale moir´e patterns and a frequency based high resolution content separation branch to retain fine details. We further collect an FHD moir´e image dataset as a new benchmark for training and evaluation. Comparison experiments and ablation studies have verified the effectiveness of the proposed framework and each functional module both quantitatively and qualitatively in practical application scenarios."



Paperid:1007
Authors:Dipu Manandhar, Dan Ruta, John Collomosse
Title: Learning Structural Similarity of User Interface Layouts using Graph Networks
Abstract:
We propose a novel representation learning technique for measuring the similarity of user interface designs. A triplet network is used to learn a search embedding for layout similarity, with a hybrid encoder-decoder backbone comprising a graph convolutional network (GCN) and convolutional decoder (CNN). The properties of interface components and their spatial relationships are encoded via a graph which also models the containment (nesting) relationships of interface components. We supervise the training of a dual reconstruction and pair-wise loss using an auxiliary measure of layout similarity based on intersection over union (IoU) distance. The resulting embedding is shown to exceed state of the art performance for visual search of user interface layouts over the public Rico dataset, and an auto-annotated dataset of interface layouts collected from the web."



Paperid:1008
Authors:Yutao Hu ¹, Xiaolong Jiang ², Xuhui Liu, Baochang Zhang, Jungong Han, Xianbin Cao ², David Doermann
Title: NAS-Count: Counting-by-Density with Neural Architecture Search
Abstract:
Most of the recent advances in crowd counting have evolved from hand-designed density estimation networks, where multi-scale features are leveraged to address the scale variation problem, but at the expense of demanding design efforts. In this work, we automate the design of counting models with Neural Architecture Search (NAS) and introduce an end-to-end searched encoder-decoder architecture, Automatic Multi-Scale Network (AMSNet). Specifically, we utilize a counting-specific two-level search space. The encoder and decoder in AMSNet are composed of different cells discovered from micro-level search, while the multi-path architecture is explored through macro-level search. To solve the pixel-level isolation issue in MSE loss, AMSNet is optimized with an auto-searched Scale Pyramid Pooling Loss (SPPLoss) that supervises the multi-scale structural information. Extensive experiments on four datasets show AMSNet produces state-of-the-art results that outperform hand-designed models, fully demonstrating the efficacy of NAS-Count."



Paperid:1009
Authors:Andrea Simonelli, Samuel Rota Buló, Lorenzo Porzi, Elisa Ricci, Peter Kontschieder
Title: Towards Generalization Across Depth for Monocular 3D Object Detection
Abstract:
While expensive LiDAR and stereo camera rigs have enabled the development of successful 3D object detection methods, monocular RGB-only approaches lag much behind. This work advances the state of the art by introducing MoVi-3D, a novel, single-stage deep architecture for monocular 3D object detection. MoVi-3D builds upon a novel approach which leverages geometrical information to generate, both at training and test time, virtual views where the object appearance is normalized with respect to distance. These virtually generated views facilitate the detection task as they significantly reduce the visual appearance variability associated to objects placed at different distances from the camera. As a consequence, the deep model is relieved from learning depth-specific representations and its complexity can be significantly reduced. In particular, in this work we show that, thanks to our virtual views generation process, a lightweight, single-stage architecture suffices to set new state-of-the-art results on the popular KITTI3D benchmark."



Paperid:1010
Authors:Corneliu Florea, Mihai Badea, Laura Florea, Andrei Racoviteanu, Constantin Vertan
Title: Margin-Mix: Semi–Supervised Learning for Face Expression Recognition
Abstract:
In this paper, as we aim to construct a semi-supervised learning algorithm, we exploit the characteristics of the Deep Convolutional Networks to provide, for an input image, both an embedding descriptor and a prediction. The unlabeled data is combined with the labeled one in order to provide synthetic data, which describes better the input space. The network is asked to provide a large margin between clusters, while new data is self-labeled by the distance to class centroids, in the embedding space. The method is tested on standard benchmarks for semi--supervised learning, where it matches state of the art performance and on the problem of face expression recognition where it increases the accuracy by a noticeable margin."



Paperid:1011
Authors:Marianne Bakken, Johannes Kvam, Alexey A. Stepanov, Asbjørn Berge
Title: Principal Feature Visualisation in Convolutional Neural Networks
Abstract:
We introduce a new visualisation technique for CNNs called Principal Feature Visualisation (PFV). It uses a single forward pass of the original network to map principal features from the final convolutional layer to the original image space as RGB channels. By working on a batch of images we can extract contrasting features, not just the most dominant ones with respect to the classification. This allows us to differentiate between several features in one image in an unsupervised manner. This enables us to assess the feasibility of transfer learning and to debug a pre-trained classifier by localising misleading or missing features."



Paperid:1012
Authors:Xiaolin Song Kaili Zhao Wen-Sheng Chu Honggang Zhang Jun Guo
Title: Progressive Refinement Network for Occluded Pedestrian Detection
Abstract:
We present Progressive Refinement Network (PRNet), a novel single-stage detector that tackles occluded pedestrian detection. Motivated by human's progressive process on annotating occluded pedestrians, PRNet achieves sequential refinement by three phases: Finding high-confident anchors of visible parts, calibrating such anchors to a full-body template derived from occlusion statistics, and then adjusting the calibrated anchors to final full-body regions. Unlike conventional methods that exploit predefined anchors, the confidence-aware calibration offers adaptive anchor initialization for detection with occlusions, and helps reduce the gap between visible-part and full-body detection. In addition, we introduce an occlusion loss to up-weigh hard examples, and a Receptive Field Backfeed (RFB) module to diversify receptive fields in early layers that commonly fire only on visible parts or small-size full-body regions. Experiments were performed within and across CityPersons, ETH, and Caltech datasets. Results show that PRNet can match the speed of existing single-stage detectors, consistently outperforms alternatives in terms of overall miss rate, and offers significantly better cross-dataset generalization. Code is available."



Paperid:1013
Authors:Ruilong Li, Yuliang Xiu, Shunsuke Saito, Zeng Huang, Kyle Olsewski, Hao Li
Title: Monocular Real-Time Volumetric Performance Capture
Abstract:
We present the first approach to volumetric performance capture and novel-view rendering at real-time speed from monocular video, eliminating the need for expensive multi-view systems or cumbersome pre-acquisition of a personalized template model. Our system reconstructs a fully textured 3D human from each frame by leveraging Pixel-Aligned Implicit Function (PIFu). While PIFu achieves high-resolution reconstruction in a memory-efficient manner, its computationally expensive inference prevents us from deploying such a system for real-time applications. To this end, we propose a novel hierarchical surface localization algorithm and a direct rendering method without explicitly extracting surface meshes. By culling unnecessary regions for evaluation in a coarse-to-fine manner, we successfully accelerate the reconstruction by two orders of magnitude from the baseline without compromising the quality. Furthermore, we introduce an Online Hard Example Mining (OHEM) technique that effectively suppresses failure modes due to the rare occurrence of challenging examples. We adaptively update the sampling probability of the training data based on the current reconstruction accuracy, which effectively alleviates reconstruction artifacts. Our experiments and evaluations demonstrate the robustness of our system to various challenging angles, illuminations, poses, and clothing styles. We also show that our approach compares favorably with state-of-the-art monocular performance capture. Our proposed approach removes the need for multi-view studio settings and enables a consumer-accessible solution for volumetric capture."



Paperid:1014
Authors:Christian Ertler, Jerneja Mislej, Tobias Ollmann, Lorenzo Porzi, Gerhard Neuhold, Yubin Kuang
Title: The Mapillary Traffic Sign Dataset for Detection and Classification on a Global Scale
Abstract:
Traffic signs are essential map features for smart cities and navigation. To develop accurate and robust algorithms for traffic sign detection and classification, a large-scale and diverse benchmark dataset is required. In this paper, we introduce a new traffic sign dataset of 105K street-level images around the world covering 400 manually annotated traffic sign classes in diverse scenes, wide range of geographical locations, and varying weather and lighting conditions. The dataset includes 52K fully annotated images. Additionally, we show how to augment the dataset with 53K semi-supervised, partially annotated images. This is the largest and the most diverse traffic sign dataset consisting of images from all over the world with fine-grained annotations of traffic sign classes. We run extensive experiments to establish strong baselines for both detection and classification tasks. In addition, we verify that the diversity of this dataset enables effective transfer learning for existing large-scale benchmark datasets on traffic sign detection and classification. The dataset is freely available for academic research at https://www.mapillary.com/dataset/trafficsign."



Paperid:1015
Authors:Anil Armagan, Guillermo Garcia-Hernando, Seungryul Baek, Shreyas Hampali, Mahdi Rad, Zhaohui Zhang, Shipeng Xie, MingXiu Chen, Boshen Zhang, Fu Xiong, Yang Xiao, Zhiguo Cao, Junsong Yuan, Pengfei Ren⁸, Weiting Huang⁸, Haifeng Sun⁸, Marek Hrúz⁹, Jakub Kanis⁹, Zdeněk Krňoul⁹, Qingfu Wan, Shile Li, Linlin Yang, Dongheui Lee, Angela Yao, Weiguo Zhou, Sijia Mei, Yunhui Liu, Adrian Spurr, Umar Iqbal, Pavlo Molchanov, Philippe Weinzaepfel, Romain Brégier, Grégory Rogez, Vincent Lepetit, Tae-Kyun Kim
Title: Measuring Generalisation to Unseen Viewpoints, Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction
Abstract:
Articulations, Shapes and Objects for 3D Hand Pose Estimation under Hand-Object Interaction","We study how well different types of approaches generalise in the task of 3D hand pose estimation under single hand scenarios and hand-object interaction. We show that the accuracy of state-of-the-art methods can drop, and that they fail mostly on poses absent from the training set. Unfortunately, since the space of hand poses is highly dimensional, it is inherently not feasible to cover the whole space densely, despite recent efforts in collecting large-scale training datasets. This sampling problem is even more severe when hands are interacting with objects and/or inputs are RGB rather than depth images, as RGB images also vary with lighting conditions and colors. To address these issues, we designed a public challenge (HANDS'19) to evaluate the abilities of current 3D hand pose estimators~(HPEs) to interpolate and extrapolate the poses of a training set. More exactly, HANDS'19 is designed (a) to evaluate the influence of both depth and color modalities on 3D hand pose estimation, under the presence or absence of objects; (b) to assess the generalisation abilities w.r.t. four main axes: shapes, articulations, viewpoints, and objects; (c) to explore the use of a synthetic hand models to fill the gaps of current datasets. Through the challenge, the overall accuracy has dramatically improved over the baseline, especially on extrapolation tasks, from 27mm to 13mm mean joint error. Our analyses highlight the impacts of: ensemble approaches, the use of a parametric 3D hand model (MANO), and different HPE methods/backbones."



Paperid:1016
Authors:Sarthak Bhagat, Shagun Uppal, Zhuyun Yin, Nengli Lim
Title: Disentangling Multiple Features in Video Sequences using Gaussian Processes in Variational Autoencoders
Abstract:
We introduce MGP-VAE (Multi-disentangled-features Gaussian Processes Variational AutoEncoder), a variational autoencoder which uses Gaussian processes (GP) to model the latent space for the unsupervised learning of disentangled representations in video sequences. We improve upon previous work by establishing a framework by which multiple features, static or dynamic, can be disentangled. Specifically we use fractional Brownian motions (fBM) and Brownian bridges (BB) to enforce an inter-frame correlation structure in each independent channel, and show that varying this structure enables one to capture different factors of variation in the data. We demonstrate the quality of our representations with experiments on three publicly available datasets, and also quantify the improvement using a video prediction task. Moreover, we introduce a novel geodesic loss function which takes into account the curvature of the data manifold to improve learning. Our experiments show that the combination of the improved representations with the novel loss function enable MGP-VAE to outperform the baselines in video prediction."



Paperid:1017
Authors:Van Nhan Nguyen, Sigurd Løkse, Kristoffer Wickstrøm, Michael Kampffmeyer, Davide Roverso, Robert Jenssen
Title: SEN: A Novel Feature Normalization Dissimilarity Measure for Prototypical Few-Shot Learning Networks
Abstract:
In this paper, we equip Prototypical Networks (PNs) with a novel dissimilarity measure to enable discriminative feature normalization for few-shot learning. The embedding onto the hypersphere requires no direct normalization and is easy to optimize. Our theoretical analysis shows that the proposed dissimilarity measure, denoted the Squared root of the Euclidean distance and the Norm distance (SEN), forces embedding points to be attracted to its correct prototype, while being repelled from all other prototypes, keeping the norm of all points the same. The resulting SEN PN outperforms the regular PN with a considerable margin, with no additional parameters as well as with negligible computational overhead."



Paperid:1018
Authors:Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, Bernt Schiele
Title: Kinematic 3D Object Detection in Monocular Video
Abstract:
Perceiving the physical world in 3D is fundamental for self-driving applications. Although temporal motion is an invaluable resource to human vision for detection, tracking, and depth perception, such features have not been thoroughly utilized in modern 3D object detectors. In this work, we propose a novel method for monocular video-based 3D object detection which carefully leverages kinematic motion to improve precision of 3D localization. Specifically, we first propose a novel decomposition of object orientation as well as a self-balancing 3D confidence. We show that both components are critical to enable our kinematic model to work effectively. Collectively, using only a single model, we efficiently leverage 3D kinematics from monocular videos to improve the overall localization precision in 3D object detection while also producing useful by-products of scene dynamics (ego-motion and per-object velocity). We achieve state-of-the-art performance on monocular 3D object detection and the Bird's Eye View tasks within the KITTI self-driving dataset. "



Paperid:1019
Authors:Ye Zhu, Yu Wu, Yi Yang, Yan Yan
Title: Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents
Abstract:
With the arising concerns for the AI systems provided with direct access to abundant sensitive information, researchers seek to develop more reliable AI with implicit information sources. To this end, in this paper, we introduce a new task called video description via two multi-modal cooperative dialog agents, whose ultimate goal is for one conversational agent to describe an unseen video based on the dialog and two static frames. Specifically, one of the intelligent agents - Q-BOT - is given two static frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has already seen the entire video, assists Q-BOT to accomplish the goal by providing answers to those questions. We propose a QA-Cooperative Network with a dynamic dialog history update learning mechanism to transfer knowledge from A-BOT to Q-BOT, thus helping Q-BOT to better describe the video. Extensive experiments demonstrate that Q-BOT can effectively learn to describe an unseen video by the proposed model and the cooperative learning method, achieving the promising performance where Q-BOT is given the full ground truth history dialog."



Paperid:1020
Authors:Sangmin Lee, Jung Uk Kim, Hak Gu Kim, Seongyeop Kim, Yong Man Ro
Title: SACA Net: Cybersickness Assessment of Individual Viewers for VR Content via Graph-based Symptom Relation Embedding
Abstract:
Recently, cybersickness assessment for VR content is required to deal with viewing safety issues. Assessing physical symptoms of individual viewers is challenging but important to provide detailed and personalized guides for viewing safety. In this paper, we propose a novel symptom-aware cybersickness assessment network (SACA Net) that quantifies physical symptom levels for assessing cybersickness of individual viewers. The SACA Net is designed to utilize the relational characteristics of symptoms for complementary effects among relevant symptoms. The proposed network consists of three main parts: a stimulus symptom context guider, a physiological symptom guider, and a symptom relation embedder. The stimulus symptom context guider and the physiological symptom guider extract symptom features from VR content and human physiology, respectively. The symptom relation embedder refines the stimulus-response symptom features to effectively predict cybersickness by embedding relational characteristics with graph formulation. For validation, we utilize two public 360-degree video datasets that contain cybersickness scores and physiological signals. Experimental results show that the proposed method is effective in predicting human cybersickness with physical symptoms. Further, latent relations among symptoms are interpretable by analyzing relational weights in the proposed network."



Paperid:1021
Authors:Ziyi Meng, Jiawei Ma, Xin Yuan
Title: End-to-End Low Cost Compressive Spectral Imaging with Spatial-Spectral Self-Attention
Abstract:
Coded aperture snapshot spectral imaging (CASSI) is an effective tool to capture real-world 3D hyperspectral images. While a number of existing work has been conducted for hardware and algorithm design, we make a step towards the low-cost solution that enjoys video-rate high-quality reconstruction. To make solid progress on this challenging yet under-investigated task, we reproduce a stable single disperser (SD) CASSI system to gather large-scale real-world CASSI data and propose a novel deep convolutional network to carry out the real-time reconstruction by using self-attention. In order to jointly capture the self-attention across different dimensions in hyperspectral images (i.e., channel-wise spectral correlation and non-local spatial regions), we propose Spatial-Spectral Self-Attention (TSA) to process each dimension sequentially, yet in an order-independent manner. We employ TSA in an encoder-decoder network, dubbed TSA-Net, to reconstruct the desired 3D cube. Furthermore, we investigate how noise affects the results and propose to add shot noise in model training, which improves the real data results significantly. We hope our large-scale CASSI data serve as a benchmark in future research and our TSA model as a baseline in deep learning based reconstruction algorithms."



Paperid:1022
Authors:Goutam Bhat, Martin Danelljan, Luc Van Gool, Radu Timofte
Title: Know Your Surroundings: Exploiting Scene Information for Object Tracking
Abstract:
Current state-of-the-art trackers rely only on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions.In this work, we propose a novel tracking architecture which can utilize scene information for tracking. Our tracker represents such information as dense localized state vectors, which can encode, for example, if a local region is target, background, or distractor. These state vectors are propagated through the sequence and combined with the appearance model output to localize the target. Our network is learned to effectively utilize the scene information by directly maximizing tracking performance on video segments. The proposed approach sets a new state-of-the-art on 3 tracking benchmarks, achieving an AO score of 63.6% on the recent GOT-10k dataset."



Paperid:1023
Authors:Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, Meng Wang
Title: Practical Detection of Trojan Neural Networks: Data-Limited and Data-Free Cases
Abstract:
When the training data are maliciously tampered, the predictions of the acquired deep neural network (DNN) can be manipulated by an adversary known as the Trojan attack (or poisoning backdoor attack). The lack of robustness of DNNs against Trojan attacks could significantly harm real-life machine learning (ML) systems in downstream applications, therefore posing widespread concern to their trustworthiness. In this paper, we study the problem of the Trojan network (TrojanNet) detection in the data-scarce regime, where only the weights of a trained DNN are accessed by the detector. We first propose a data-limited TrojanNet detector (TND), when only a few data samples are available for TrojanNet detection. We show that an effective data-limited TND can be established by exploring connections between Trojan attack and prediction-evasion adversarial attacks including per-sample attack as well as all-sample universal attack. In addition, we propose a data-free TND, which can detect a TrojanNet without accessing any data samples. We show that such a TND can be built by leveraging the internal response of hidden neurons, which exhibits the Trojan behavior even at random noise inputs. The effectiveness of our proposals is evaluated by extensive experiments under different model architectures and datasets including CIFAR-10, GTSRB, and ImageNet."



Paperid:1024
Authors:Haomin Chen, Yirui Wang, Kang Zheng, Weijian Li, Chi-Tung Chang, Adam P. Harrison, Jing Xiao, Gregory D. Hager, Le Lu, Chien-Hung Liao, Shun Miao
Title: Anatomy-Aware Siamese Network: Exploiting Semantic Asymmetry for Accurate Pelvic Fracture Detection in X-ray Images
Abstract:
Trauma PXR are essential for instantaneous pelvic bone fracture detection. However, small, pathologically critical fractures can be missed, even by experienced clinicians, under the very limited diagnosis times allowed in urgent care. As a result, fracture CAD has very high demands to save time and assist physicians to detect (otherwise) missed fractures more accurately and reliably. In this work, we present a new approach to fracture detection that uses a Siamese network to take advantage of the anatomical symmetry of pelvic structures to improve fracture detection. We show that symmetric alignment at the network feature level makes Siamese learning more spatially accurate and also reduces the influence of imaging artifacts. Metric learning on Siamese deep features further improves CAD differentiation power with the anatomical symmetry cue. We evaluate our method on 2,359 PXR patients, reporting an area under the ROC curve value of 0.9771, the highest among state-of-the-art fracture detection methods. "



Paperid:1025
Authors:Elizaveta Logacheva, Roman Suvorov, Oleg Khomenko, Anton Mashikhin, Victor Lempitsky
Title: DeepLandscape: Adversarial Modeling of Landscape Videos
Abstract:
We build a new model of landscape videos that can be trained on a mixture of static landscape images as well as landscape animations. Our architecture extends StyleGAN model by augmenting it with parts that allow to model dynamic changes in a scene. Once trained, our model can be used to generate realistic time-lapse landscape videos with moving objects and time-of-the-day changes. Furthermore, by fitting the learned models to a static landscape image, the latter can be reenacted in a realistic way. We propose simple but necessary modifications to StyleGAN inversion procedure, which lead to in-domain latent codes and allow to manipulate real images. Quantitative comparisons and user studies suggest that our model produces more compelling animations of given photographs than previously proposed methods. The results of our approach including comparisons with prior art can be seen in supplementary materials and on the project page https://saic-mdal.github.io/deep-landscape/."



Paperid:1026
Authors:Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusiñol, Alicia Fornés, Mauricio Villegas
Title: GANwriting: Content-Conditioned Generation of Styled Handwritten Word Images
Abstract:
Although current image generation methods have reached impressive quality levels, they are still unable to produce plausible yet diverse images of handwritten words. On the contrary, when writing by hand, a great variability is observed across different writers, and even when analyzing words scribbled by the same individual, involuntary variations are conspicuous. In this work, we take a step closer to producing realistic and varied artificially rendered handwritten words. We propose a novel method that is able to produce credible handwritten word images by conditioning the generative process with both calligraphic style features and textual content. Our generator is guided by three complementary learning objectives: to produce realistic images, to imitate a certain handwriting style and to convey a specific textual content. Our model is unconstrained to any predefined vocabulary, being able to render whatever input word. Given a sample writer, it is also able to mimic its calligraphic features in a few-shot setup. We significantly advance over prior art and demonstrate with qualitative, quantitative and human-based evaluations the realistic aspect of our synthetically produced images."



Paperid:1027
Authors:Yingqian Wang,  Longguang Wang,  Jungang Yang, Wei An,  Jingyi Yu,  Yulan Guo
Title: Spatial-Angular Interaction for Light Field Image Super-Resolution
Abstract:
Light field (LF) cameras record both intensity and directions of light rays, and capture scenes from a number of viewpoints. Both information within each perspective (i.e., spatial information) and among different perspectives (i.e., angular information) is beneficial to image super-resolution (SR). In this paper, we propose a spatial-angular interactive network (namely, LF-InterNet) for LF image SR. Specifically, spatial and angular features are first separately extracted from input LFs, and then repetitively interacted to progressively incorporate spatial and angular information. Finally, the interacted features are fused to superresolve each sub-aperture image. Experimental results demonstrate the superiority of LF-InterNet over the state-of-the-art methods, i.e., our method can achieve high PSNR and SSIM scores with low computational cost, and recover faithful details in the reconstructed images."



Paperid:1028
Authors:Adrian Bulat, Brais Martinez, Georgios Tzimiropoulos
Title: BATS: Binary ArchitecTure Search
Abstract:
This paper proposes Binary ArchitecTure Search (BATS), a framework that drastically reduces the accuracy gap between binary neural networks and their real-valued counterparts by means of Neural Architecture Search (NAS). We show that directly applying NAS to the binary domain provides very poor results. To alleviate this, we describe, to our knowledge, for the first time, the 3 key ingredients for successfully applying NAS to the binary domain. Specifically, we (1) introduce and design a novel binary-oriented search space, (2) propose a new mechanism for controlling and stabilising the resulting searched topologies, (3) propose and validate a series of new search strategies for binary networks that lead to faster convergence and lower search times. Experimental results demonstrate the effectiveness of the proposed approach and the necessity of searching in the binary space directly. Moreover, (4) we set a new state-of-the-art for binary neural networks on CIFAR10, CIFAR100 and ImageNet datasets. Code will be made available."



Paperid:1029
Authors:Ze Liu(†), Han Hu, Yue Cao, Zheng Zhang, Xin Tong
Title: A Closer Look at Local Aggregation Operators in Point Cloud Analysis
Abstract:
Recent advances of network architecture for point cloud processing are mainly driven by new designs of local aggregation operators. However, the impact of these operators to network performance is not carefully investigated due to different overall network architecture and implementation details in each solution. Meanwhile, most of operators are only applied in shallow architectures. In this paper, we revisit the representative local aggregation operators and study their performance using the same deep residual architecture. Our investigation reveals that depsite the different designs of these operators, all of these operators make surprisingly similar contributions to the network performance under the same network input and feature numbers and result in the state-of-the-art accuracy on standard benchmarks. This finding stimulate us to rethink the necessity of sophisticated design of local aggregation operator for point cloud processing. To this end, we propose a simple local aggregation operator without learnable weights, named Position Pooling (PosPool), which performs similarly or slightly better than existing sophisticated operators. In particular, a simple deep residual network with PosPool layers exhibits outstanding performance on all benchmarks, which outperforms the previous state-of-the methods on the challenging PartNet datasets by a large margin (6.4 mIoU). "



Paperid:1030
Authors:Youssef A. Mejjati, Celso F. Gomez, Kwang In Kim, Eli Shechtman, Zoya Bylinskii
Title: Look here! A parametric learning based approach to redirect visual attention
Abstract:
Across photography, marketing, and website design, being able to direct the viewer's attention is a powerful tool. Motivated by professional workflows, we introduce an automatic method to make an image region more attention-capturing via subtle image edits that maintain realism and fidelity to the original. From an input image and a user-provided mask, our GazeShiftNet model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions separately. We present the results of quantitative and qualitative experiments that demonstrate improvements over prior state-of-the-art. In contrast to existing attention shifting algorithms, our global parametric approach better preserves image semantics and avoids typical generative artifacts. Our edits enable inference at interactive rates on any image size, and easily generalize to videos. Extensions of our model allow for multi-style edits and the ability to both increase and attenuate attention in an image region. Furthermore, users can customize the edited images by dialing the edits up or down via interpolations in parameter space. This paper presents a practical tool that can simplify future image editing pipelines."



Paperid:1031
Authors:Henry Li, Ofir Lindenbaum, Xiuyuan Cheng, Alexander Cloninger
Title: Variational Diffusion Autoencoders with Random Walk Sampling
Abstract:
Variational autoencoders (VAEs) and generative adversarial networks (GANs) enjoy an intuitive connection to manifold learning: in training the decoder/generator is optimized to approximate a homeomorphism between the data distribution and the sampling space. This is a construction that strives to define the data manifold. A major obstacle to VAEs and GANs, however, is choosing a suitable prior that matches the data topology. Well-known consequences of poorly picked priors are posterior and mode collapse. To our knowledge, no existing method sidesteps this user choice. Conversely, extit{diffusion maps} automatically infer the data topology and enjoy a rigorous connection to manifold learning, but do not scale easily or provide the inverse homeomorphism (i.e. decoder/generator). We propose a method that combines these approaches into a generative model that inherits the asymptotic guarantees of extit{diffusion maps} while preserving the scalability of deep models. We prove approximation theoretic results for the dimension dependence of our proposed method. Finally, we demonstrate the effectiveness of our method with various real and synthetic datasets."



Paperid:1032
Authors:Xin Wen, Biying Li, Haiyun Guo, Zhiwei Liu, Guosheng Hu, Ming Tang, Jinqiao Wang
Title: Adaptive Variance Based Label Distribution Learning For Facial Age Estimation
Abstract:
Estimating age from a single facial image is a classic and challenging topic in computer vision. One of its most intractable issues is label ambiguity, i.e., face images from adjacent age of the same person are often indistinguishable. Some existing methods adopt distribution learning to tackle this issue by exploiting the semantic correlation between age labels. Actually, most of them set a fixed value to the variance of Gaussian label distribution for all the images. However, the variance is closely related to the correlation between adjacent ages and should vary across ages and identities. To model a sample-specific variance, in this paper, we propose an adaptive variance based distribution learning (AVDL) method for facial age estimation. AVDL introduces the data-driven optimization framework, meta-learning, to achieve this. Specifically, AVDL performs a meta gradient descent step on the variable (i.e. variance) to minimize the loss on a clean unbiased validation set. By adaptively learning proper variance for each sample, our method can approximate the true age probability distribution more effectively. Extensive experiments on FG-NET and MORPH II datasets show the superiority of our proposed approach to the existing state-of-the-art methods."



Paperid:1033
Authors:Shasha Li, Shitong Zhu, Sudipta Paul, Amit Roy-Chowdhury, Chengyu Song, Srikanth Krishnamurthy, Ananthram Swami, Kevin S Chan
Title: Connecting the Dots: Detecting Adversarial Perturbations Using Context Inconsistency
Abstract:
There has been a recent surge in research on adversarial perturbations that defeat Deep Neural Networks (DNNs); most of these attacks target object classifiers. Inspired by the observation that humans are able to recognize objects that appear out of place in a scene or along with other unlikely objects, we augment the DNN with a system that learns context consistency rules during training and checks for the violations of the same during testing. In brief, our approach builds a set of autoencoders, one for each object class, appropriately trained so as to output a discrepancy between the input and output if a perturbation was added to the sample and trigger context violation. Experiments on PASCAL VOC and MS COCO show that our method effectively detects various adversarial attacks and achieves high ROC-AUC (over 0.95 in most cases); this corresponds to over 20-45 % improvement over a baseline context agnostic method."



Paperid:1034
Authors:Abbas Sadat, Sergio Casas, Mengye Ren, Xinyu Wu, Pranaab Dhawan, Raquel Urtasun
Title: Perceive, Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations
Abstract:
Predict, and Plan: Safe Motion Planning Through Interpretable Semantic Representations","In this paper we propose a novel end-to-end learnable network that performs joint perception, prediction and motion planning for self-driving vehicles and produces interpretable intermediate representations. Unlike existing neural motion planners, our motion planning costs are consistent with our perception and prediction estimates. This is achieved by a novel differentiable semantic occupancy representation that is explicitly used as cost by the motion planning process. Our network is learned end-to-end from human demonstration. Our experiments in a large-scale manual-driving dataset and closed-loop simulation show that the proposed model significantly outperforms state-of-the-art planners in imitating the human behaviors while producing much safer trajectories."



Paperid:1035
Authors:Sangeek Hyun, Jae-Pil Heo
Title: VarSR: Variational Super-Resolution Network for Very Low Resolution Images
Abstract:
As is well known, single image super-resolution (SR) is an ill-posed problem where multiple high resolution (HR) images can be matched to one low resolution (LR) image due to the difference of their representation capabilities. Such many-to-one nature is particularly magnified when super-resolving with large upscaling factors from very low dimensional domains such as 8$ imes$8 resolution where detailed information of HR is hardly discovered. Most existing methods are optimized for deterministic generation of SR images under pre-defined objectives such as pixel-level reconstruction and thus limited to the one-to-one correspondence between LR and SR images against the nature. In this paper, we propose VarSR, Variational Super Resolution Network, that matches latent distributions of LR and HR images to recover the missing details. Specifically, we draw samples from the learned common latent distribution of LR and HR to generate diverse SR images as the many-to-one relationship. Experimental results validate that our method can produce more accurate and perceptually plausible SR images from very low resolutions compared to the deterministic techniques."



Paperid:1036
Authors:Ashwin Raju, Chi-Tung Cheng, Yuankai Huo, Jinzheng Cai, Junzhou Huang, Jing Xiao, Le Lu, ChienHung Liao, Adam P. Harrison
Title: Co-Heterogeneous and Adaptive Segmentation from Multi-Source and Multi-Phase CT Imaging Data: A Study on Pathological Liver and Lesion Segmentation
Abstract:
Within medical imaging, organ/pathology segmentation models trained on current publicly available and fully-annotated datasets usually do not well-represent the heterogeneous modalities, phases, pathologies, and clinical scenarios encountered in real environments. On the other hand, there are tremendous amounts of unlabelled patient imaging scans stored by many modern clinical centers. In this work, we present a novel segmentation strategy, co-heterogenous andadaptive segmentation (CHASe), which only requires a small labeled cohort of single phase data to adapt to any unlabeled cohort of heterogenous multi-phase data with possibly new clinical scenarios and pathologies. To do this, we develop a versatile framework that fuses appearance-based semi-supervision, mask-based adversarial domain adaptation, and pseudo-labeling. We also introduce co-heterogeneous training, which is a novel integration of co-training and hetero-modality learning. We evaluate CHASe using a clinically comprehensive and challenging dataset of multi-phase computed tomography (CT) imaging studies (1147 patients and 4577 3D volumes), with a test set of 100 patients. Compared to previous state-of-the-art baselines, CHASe can further improve pathological liver mask Dice-Sørensen coefficients by ranges of 4.2% to 9.4%, depending on the phase combinations, e.g., from 84.6% to 94.0% on non-contrast CTs."



Paperid:1037
Authors:Massimiliano Mancini, Zeynep Akata, Elisa Ricci, Barbara Caputo
Title: Towards Recognizing Unseen Categories in Unseen Domains
Abstract:
Current deep visual recognition systems suffer from severe performance degradation when they encounter new images from classes and scenarios unseen during training. Hence, the core challenge of Zero-Shot Learning (ZSL) is to cope with the semantic-shift whereas the main challenge of Domain Adaptation and Domain Generalization (DG) is the domain-shift. While historically ZSL and DG tasks are tackled in isolation, this work develops with the ambitious goal of solving them jointly, i.e. by recognizing unseen visual concepts in unseen domains. We present CuMix (Curriculum Mixup for recognizing unseen categories in unseen domains), a holistic algorithm to tackle ZSL, DG and ZSL+DG. The key idea of CuMix is to simulate the test-time domain and semantic shift using images and features from unseen domains and categories generated by mixing up the multiple source domains and categories available during training. Moreover, a curriculum-based mixing policy is devised to generate increasingly complex training samples. Results on standard ZSL and DG datasets and on ZSL+DG using the DomainNet benchmark demonstrate the effectiveness of our approach."



Paperid:1038
Authors:Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion, Matthias Hein
Title: Square Attack: a query-efficient black-box adversarial attack via random search
Abstract:
We propose the Square Attack, a score-based black-box $l_2$- and $l_\infty$- adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. Square Attack is based on a randomized search scheme which selects localized square-shaped updates at random positions so that at each iteration the perturbation is situated approximately at the boundary of the feasible set. Our method is significantly more query efficient and achieves a higher success rate compared to the state-of-the-art methods, especially in the untargeted setting. In particular, on ImageNet we improve the average query efficiency in the untargeted setting for various deep networks by a factor of at least $1.8$ and up to $3$ compared to the recent state-of-the-art $l_\infty$-attack of Al-Dujaili & O’Reilly [2]. Moreover, although our attack is black-box, it can also outperform gradient-based white-box attacks on the standard benchmarks achieving a new state-of-the-art in terms of the success rate."



Paperid:1039
Authors:Noe Samano, Mengjie Zhou, Andrew Calway
Title: You Are Here: Geolocation by Embedding Maps and Images
Abstract:
We present a novel approach to geolocalising panoramic images on a 2-D cartographic map based on learning a low dimensional embedded space, which allows a comparison between an image captured at a location and local neighbourhoods of the map. The representation is not sufficiently discriminatory to allow localisation from a single image, but when concatenated along a route, localisation converges quickly, with over 90% accuracy being achieved for routes of around 200m in length when using Google Street View and Open Street Map data. The method generalises a previous fixed semantic feature based approach and achieves significantly higher localisation accuracy and faster convergence."



Paperid:1040
Authors:Yang He, Shadi Rahimian, Bernt Schiele, Mario Fritz
Title: Segmentations-Leak: Membership Inference Attacks and Defenses in Semantic Image Segmentation
Abstract:
Today's success of state of the art methods for semantic segmentation is driven by large datasets. Data is considered an important asset that needs to be protected, as the collection and annotation of such datasets comes at significant efforts and associated costs. In addition, visual data might contain private or sensitive information, that makes it equally unsuited for public release. Unfortunately, recent work on membership inference in the broader area of adversarial machine learning and inference attacks on machine learning models has shown that even black box classifiers leak information on the dataset that they were trained on. We present the first attacks and defenses for complex, state of the art models for semantic segmentation. In order to mitigate the associated risks, we also study a series of defenses against such membership inference attacks and find effective counter measures against the existing risks. Finally, we extensively evaluate our attacks and defenses on a range of relevant real-world datasets: Cityscapes, BDD100K, and Mapillary Vistas."



Paperid:1041
Authors:Jesse Scott, Bharadwaj Ravichandran, Christopher Funk, Robert T. Collins, Yanxi Liu
Title: From Image to Stability: Learning Dynamics from Human Pose
Abstract:
We propose and validate two end-to-end deep learning architectures to learn foot pressure distribution maps (dynamics) from 2D or 3D human pose (kinematics). The networks are trained using 1.36 million synchronized pose+pressure data pairs from 10 subjects performing multiple takes of a 5-minute long choreographed Taiji sequence. Using leave-one-subject-out cross validation, we demonstrate reliable and repeatable foot pressure prediction, setting the first baseline for solving a non-obvious pose to pressure cross-modality mapping problem in computer vision. Furthermore, we compute and quantitatively validate Center of Pressure (CoP) and Base of Support (BoS), two key components for stability analysis, from the predicted foot pressure distributions."



Paperid:1042
Authors:Namdar Homayounfar Yuwen Xiong Justin Liang Wei-Chiu Ma Raquel Urtasun {namdar,yuwen,justin.liang,weichiu,urtasun}@uber.com
Title: LevelSet R-CNN: A Deep Variational Method for Instance Segmentation
Abstract:
Obtaining precise instance segmentation masks is of high importance in many modern applications such as robotic manipulation and autonomous driving. Currently, many state of the art models are based on the Mask R-CNN framework which, while very powerful, outputs masks at low resolutions which could result in imprecise boundaries. On the other hand, classic variational methods for segmentation impose desirable global and local data and geometry constraints on the masks by optimizing an energy functional. While mathematically elegant, their direct dependence on good initialization, non-robust image cues and manual setting of hyperparameters renders them unsuitable for modern applications. We propose LevelSet R-CNN, which combines the best of both worlds by obtaining powerful feature representations that are combined in an end-to-end manner with a variational segmentation framework. We demonstrate the effectiveness of our approach on COCO and Cityscapes datasets. "



Paperid:1043
Authors:Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui Mingxing Tan, Quoc Le, Xiaodan Song
Title: Efficient Scale-Permuted Backbone with Learned Resource Distribution
Abstract:
Recently, SpineNet has demonstrated promising results on object detection and image classification over ResNet model. However, it is unclear if the improvement adds up when combining scale-permuted backbone with advanced efficient operations and compound scaling. Furthermore, SpineNet is built with a uniform resource distribution over operations. While this strategy seems to be prevalent for scale-decreased models, it may not be an optimal design for scale-permuted models. In this work, we propose a simple technique to combine efficient operations and compound scaling with a previously learned scale-permuted architecture. We demonstrate the efficiency of scale-permuted model can be further improved by learning a resource distribution over the entire network. The resulting efficient scale-permuted models outperform state-of-the-art EfficientNet-based models on object detection and achieve competitive performance on image classification and semantic segmentation."



Paperid:1044
Authors:Jian Gao, Yang Hua, Guosheng Hu, Chi Wang, Neil M. Robertson
Title: Reducing Distributional Uncertainty by Mutual Information Maximisation and Transferable Feature Learning
Abstract:
Distributional uncertainty exists broadly in many real-world applications, one of which in the form of domain discrepancy. Yet in the existing literature, the mathematical definition of it is missing. In this paper, we propose to formulate the distributional uncertainty both between the source(s) and target domain(s) and within each domain using mutual information. Further, to reduce distributional uncertainty (e.g. domain discrepancy), we (1) maximise the mutual information between source and target domains and (2) propose a transferable feature learning scheme, balancing two complementary and discriminative feature learning processes (general texture learning and self-supervised transferable shape learning) according to the uncertainty. We conduct extensive experiments on both domain adaption and domain generalisation using challenging common benchmarks: Office-Home and DomainNet. Results show the great effectiveness of the proposed method and its superiority over the state-of-the-art methods. "



Paperid:1045
Authors:Alireza Zareian, Svebor Karaman, Shih-Fu Chang
Title: Bridging Knowledge Graphs to Generate Scene Graphs
Abstract:
Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method."



Paperid:1046
Authors:Sergio Casas, Cole Gulino, Simon Suo, Katie Luo, Renjie Liao, Raquel Urtasun
Title: Implicit Latent Variable Model for Scene-Consistent Motion Forecasting
Abstract:
To achieve safe and proactive self-driving, an autonomous vehicle must accurately perceive its environment, and understand the interactions among traffic participants. In this paper, we aim to learn scene-consistent motion forecasts of complex urban traffic directly from sensor data. In particular, we propose to characterize the joint distribution over future trajectories via an implicit latent variable model. We model the scene as an interaction graph and employ powerful graph neural networks to learn a distributed latent representation of the scene. Coupled with a deterministic decoder, we obtain trajectory samples that are consistent across traffic participants, achieving state-of-the-art results in motion forecasting and interaction understanding. Last but not least, we demonstrate that our motion forecasts result in safer and more comfortable ego-motion planning."



Paperid:1047
Authors:Alireza Zareian, Zhecan Wang, Haoxuan You, Shih-Fu Chang
Title: Learning Visual Commonsense for Robust Scene Graph Generation
Abstract:
Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to improve the robustness of scene understanding. To this end, we extend Transformer models to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our model can be applied on any scene graph generation model and correct its obvious mistakes, resulting in more semantically plausible scene graphs. Through extensive experiments, we show our model learns commonsense better than any alternative, and improves the accuracy of state-of-the-art scene graph generation methods."



Paperid:1048
Authors:Nicolás Astorga, Pablo Huijse, Pavlos Protopapas, Pablo Estévez
Title: MPCC: Matching Priors and Conditionals for Clustering
Abstract:
Clustering is a fundamental task in unsupervised learning that depends heavily on the data representation that is used. Deep generative models have appeared as a promising tool to learn informative low-dimensional data representations. We propose Matching Priors and Conditionals for Clustering (MPCC), a GAN-based model with an encoder to infer latent variables and cluster categories from data, and a flexible decoder to generate samples from a conditional latent space. With MPCC we demonstrate that a deep generative model can be competitive/superior against discriminative methods in clustering tasks surpassing the state of the art over a diverse set of benchmark datasets. Our experiments show that adding a learnable prior and augmenting the number of encoder updates improve the quality of the generated samples, obtaining an inception score of $9.49 \pm 0.15$ and improving the Fr\'echet inception distance over the state of the art by a $46.9\%$ in CIFAR10."



Paperid:1049
Authors:Yiqin Zhao, Tian Guo
Title: PointAR: Efficient Lighting Estimation for Mobile Augmented Reality
Abstract:
We propose an efficient lighting estimation pipeline that is suitable to run on modern mobile devices, with comparable resource complexities to state-of-the-art mobile deep learning models. Our pipeline, PointAR, takes a single RGB-D image captured from the mobile camera and a 2D location in that image, and estimates 2nd order spherical harmonics coefficients. This estimated spherical harmonics coefficients can be directly utilized by rendering engines for supporting spatially variant indoor lighting, in the context of augmented reality. Our key insight is to formulate the lighting estimation as a point cloud-based learning problem directly from point clouds, which is in part inspired by the Monte Carlo integration leveraged by real-time spherical harmonics lighting. While existing approaches estimate lighting information with complex deep learning pipelines, our method focuses on reducing the computational complexity. Through both quantitative and qualitative experiments, we demonstrate that PointAR achieves lower lighting estimation errors compared to state-of-the-art methods. Further, our method requires an order of magnitude lower resource, comparable to that of mobile-specific DNNs."



Paperid:1050
Authors:Roman Klokov, Edmond Boyer, Jakob Verbeek
Title: Discrete Point Flow Networks for Efficient Point Cloud Generation
Abstract:
Generative models have proven effective at modeling 3D shapes and their statistical variations. In this paper we investigate their application to point clouds, a 3D shape representation widely used in computer vision for which, however, only few generative models have yet been proposed. We introduce a latent variable model that builds on normalizing flows with affine coupling layers to generate 3D point clouds of an arbitrary size given a latent shape representation. To evaluate its benefits for shape modeling we apply this model for generation, autoencoding, and single-view shape reconstruction tasks. We improve over recent GAN-based models in terms of most metrics that assess generation and autoencoding. Compared to recent work based on continuous flows, our model offers a significant speedup in both training and inference times for similar or better performance. For single-view shape reconstruction we also obtain results on par with state-of-the-art voxel, point cloud, and mesh-based methods."



Paperid:1051
Authors:Zhuoning Yuan, Zhishuai Guo, Xiaotian Yu, Xiaoyu Wang, Tianbao Yang
Title: Accelerating Deep Learning with Millions of Classes
Abstract:
Abstract.Deep learning has achieved remarkable success in many classification tasks because of its great power of representation learning for complex data. However, it remains challenging when extending to classification tasks with millions of classes. Previous studies are focused on solving this problem in a distributed fashion or using a sampling-based approach to reduce the computational cost caused by the softmax layer.However, these approaches still need high GPU memory in order to work with large models and it is non-trivial to extend them to parallel settings.To address these issues, we propose an efficient training framework to handle extreme classification tasks based onRandom Projection. The key idea is that we first train a slimmed model with a random projected softmax classifier and then we recover it to the original version. We also show a theoretical guarantee that this recovered classifier can approximate the original classifier with a small error. Later, we extend our framework to parallel scenarios by adopting a communication reduction technique. In our experiment, we demonstrate that the proposed frame-work is able to train deep learning models with millions of classes and achieve above 10×speedup compared to existing approaches."



Paperid:1052
Authors:Xiuye Gu, Weixin Luo, Michael S. Ryoo, Yong Jae Lee
Title: Password-conditioned Anonymization and Deanonymization with Face Identity Transformers
Abstract:
Cameras are prevalent in our daily lives, and enable many useful systems built upon computer vision technologies such as smart cameras and home robots for service applications. However, there is also an increasing societal concern as the captured images/videos may contain privacy-sensitive information (e.g., face identity). We propose a novel face identity transformer which enables automated photo-realistic password-based anonymization and deanonymization of human faces appearing in visual data. Our face identity transformer is trained to (1) remove face identity information after anonymization, (2) recover the original face when given the correct password, and (3) return a wrong---but photo-realistic---face given a wrong password. With our carefully designed password scheme and multi-task learning objective, we achieve both anonymization and deanonymization using the same single network. Extensive experiments show that our method enables multimodal password conditioned anonymizations and deanonymizations, with superior privacy protection compared to existing anonymization methods."



Paperid:1053
Authors:Sizhuo Ma, Mohit Gupta
Title: Inertial Safety from Structured Light
Abstract:
We present inertial safety maps (ISM), a novel scene representation designed for fast detection of obstacles in scenarios involving camera or scene motion, such as robot navigation and human-robot interaction. ISM is a motion-centric representation that encodes both scene geometry and motion; different camera motion results in different ISMs for the same scene. We show that ISM can be estimated with a two-camera stereo setup without explicitly recovering scene depths, by measuring differential changes in disparity over time. We develop an active, single-shot structured light-based approach for robustly measuring ISM in challenging scenarios with textureless objects and complex geometries. The proposed approach is computationally light-weight, and can detect intricate obstacles (e.g., thin wire fences) by processing high-resolution images at high-speeds with limited computational resources. ISM can be readily integrated with depth and range maps as a complementary scene representation, potentially enabling high-speed navigation and robotic manipulation in extreme environments, with minimal device complexity."



Paperid:1054
Authors:Nicholas Sharp, Maks Ovsjanikov
Title: PointTriNet: Learned Triangulation of 3D Point Sets
Abstract:
This work considers a new task in geometric deep learning: generating a triangulation among a set of points in 3D space. We present PointTriNet, a differentiable and scalable approach enabling point set triangulation as a layer in 3D learning pipelines. The method iteratively applies two neural networks: a classification network predicts whether a candidate triangle should appear in the triangulation, while a proposal network suggests additional candidates. Both networks are structured as PointNets over nearby points and triangles, using a novel triangle-relative input encoding. Since these learning problems operate on local geometric data, our method is efficient and scalable, and generalizes to unseen shape categories. Our networks are trained in an unsupervised manner from a collection of shapes represented as point clouds. We demonstrate the effectiveness of this approach for classical meshing tasks, robustness to outliers, and as a component in end-to-end learning systems."



Paperid:1055
Authors:Huy V. Vo, Patrick Pérez, Jean Ponce
Title: Toward Unsupervised, Multi-Object Discovery in Large-Scale Image Collections
Abstract:
multi-object discovery in large-scale image collections","This paper addresses the problem of discovering the objects present in a collection of images without any supervision. We build on the optimization approach of Vo {m et al.} [34] with several key novelties: (1) We propose a novel saliency-based region proposal algorithm that achieves significantly higher overlap with ground-truth objects than other competitive methods. This procedure leverages off-the-shelf CNN features trained on classification tasks without any bounding box information, but is otherwise unsupervised. (2) We exploit the inherent hierarchical structure of proposals as an effective regularizer for the approach to object discovery of [34], boosting its performance to significantly improve over the state of the art on several standard benchmarks. (3) We adopt a two-stage strategy to select promising proposals using small random sets of images before using the whole image collection to discover the objects it depicts, allowing us to tackle, for the first time (to the best of our knowledge), the discovery of multiple objects in each one of the pictures making up datasets with up to 20,000 images, an over five-fold increase compared to existing methods, and a first step toward true large-scale unsupervised image interpretation. "



Paperid:1056
Authors:Zhenbo Song, Wayne Chen, Dylan Campbell, Hongdong Li
Title: Deep Novel View Synthesis from Colored 3D Point Clouds
Abstract:
We propose a new deep neural network which takes a colored 3D point cloud of a scene, and directly synthesizes a photo-realistic image from an arbitrary viewpoint. Key contributions of this work include a deep point feature extraction module, an image synthesis module, and an image refinement module. Our PointEncoder network extracts discriminative features from the point cloud that contain both local and global contextual information about the scene. Next, the multi-level point features are aggregated to form multi-layer feature maps. These are fed into our ImageDecoder network in order to generate a synthetic RGB image. Finally, the coarse output of the ImageDecoder network is refined using our RefineNet module, supplying more fine details and suppressing unwanted visual artifacts. To generate virtual camera viewpoints in the scene, we rotate and translate the 3D point cloud in order to synthesize new images from novel perspectives. We conduct numerous experiments on public datasets to validate our method, with respect to the quality of the synthesized views, and outperform state-of-the-art significantly."



Paperid:1057
Authors:Haoran Wang, Ying Zhang, Zhong Ji, Yanwei Pang, Lin Ma
Title: Consensus-Aware Visual-Semantic Embedding for Image-Text Matching
Abstract:
Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations, thereby exploiting their matching relationships and making the corresponding alignments. Such approaches only exploit the superficial associations contained in the instance pairwise data, with no consideration of any external commonsense knowledge, which may hinder their capabilities to reason the higher-level relationships between image and text. In this paper, we propose a Consensus-aware Visual-Semantic Embedding (CVSE) model to incorporate the consensus information, namely the commonsense knowledge shared between both modalities, into image-text matching. Specifically, the consensus information is exploited by computing the statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and deploying the constructed concept correlation graph to yield the consensus-aware concept (CAC) representations. Afterwards, CVSE learns the associations and alignments between image and text based on the exploited consensus as well as the instance-level representations for both modalities. Extensive experiments conducted on two public datasets verify that the exploited consensus makes significant contributions to constructing more meaningful visual-semantic embeddings, with the superior performances over the state-of-the-art approaches on the bidirectional image and text retrieval task. Our code of this paper is available at: https://github.com/BruceW91/CVSE."



Paperid:1058
Authors:Guanting Dong, Yueyi Zhang, Zhiwei Xiong
Title: Spatial Hierarchy Aware Residual Pyramid Network for Time-of-Flight Depth Denoising
Abstract:
Time-of-Flight (ToF) sensors have been increasingly used on mobile devices for depth sensing. However, the existence of noise, such as Multi-Path Interference (MPI) and shot noise, degrades the ToF imaging quality. Previous CNN-based methods remove ToF depth noise without considering the spatial hierarchical structure of the scene, which leads to failures in obtaining high quality depth images from a complex scene. In this paper, we propose a Spatial Hierarchy Aware Residual Pyramid Network, called SHARP-Net, to remove the depth noise by fully exploiting the geometry information of the scene on different scales. SHARP-Net first introduces a Residual Regression Module, which utilizes the depth images and amplitude images as the input, to calculate the depth residual progressively. Then, a Residual Fusion Module, summing over depth residuals from all scales, is imported to fuse multi-scale geometry information. Finally, shot noise is further eliminated by a Kernel Prediction Network. Experimental results demonstrate that our method significantly outperforms state-of-the-art ToF depth denoising methods on both synthetic and realistic datasets. The source code is available at https://github.com/ashesknight/tof-mpi-remove ."



Paperid:1059
Authors:Songtao He, Favyen Bastani, Satvat Jagwani, Mohammad Alizadeh, Hari Balakrishnan, Sanjay Chawla, Mohamed M. Elshrif, Samuel Madden, Mohammad Amin Sadeghi
Title: Sat2Graph: Road Graph Extraction through Graph-Tensor Encoding
Abstract:
Inferring road graphs from satellite imagery is a challenging computer vision task. Prior solutions fall into two categories: (1) pixel-wise segmentation-based approaches, which predict whether each pixel is on a road, and (2) graph-based approaches, which predict the road graph iteratively. We find that these two approaches have complementary strengths while suffering from their own inherent limitations.In this paper, we propose a new method, Sat2Graph, which combines the advantages of the two prior categories into a unified framework. The key idea in Sat2Graph is a novel encoding scheme, graph-tensor encoding (GTE), which encodes the road graph into a tensor representation. GTE makes it possible to train a simple, non-recurrent, supervised model to predict a rich set of features that capture the graph structure directly from an image. We evaluate Sat2Graph using two large datasets. We find that Sat2Graph surpasses prior methods on two widely used metrics, TOPO and APLS. Furthermore, whereas prior work only infers planar road graphs, our approach is capable of inferring stacked roads (e.g., overpasses), and does so robustly."



Paperid:1060
Authors:Di Hu, Xuhong Li, Lichao Mou, Pu Jin, Dong Chen, Liping Jing, Xiaoxiang Zhu, Dejing Dou
Title: Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition
Abstract:
Aerial scene recognition is a fundamental task in remote sensing and has recently received increased interest. While the visual information from overhead images with powerful models and efficient algorithms yields considerable performance on scene recognition, it still suffers from the variation of ground objects, lighting conditions etc. Inspired by the multi-channel perception theory in cognition science, in this paper, for improving the performance on the aerial scene recognition, we explore a novel audiovisual aerial scene recognition task using both images and sounds as input. Based on an observation that some specific sound events are more likely to be heard at a given geographic location, we propose to exploit the knowledge from the sound events to improve the performance on the aerial scene recognition. For this purpose, we have constructed a new dataset named AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE). With the help of this dataset, we evaluate three proposed approaches for transferring the sound event knowledge to the aerial scene recognition task in a multimodal learning framework, and show the benefit of exploiting the audio information for the aerial scene recognition. The source code is publicly available for reproducibility purposes."



Paperid:1061
Authors:Jinyu Zhao, Yusuke Monno, Masatoshi Okutomi
Title: Polarimetric Multi-View Inverse Rendering
Abstract:
A polarization camera has great potential for 3D reconstruction since the angle of polarization (AoP) of reflected light is related to an object's surface normal. In this paper, we propose a novel 3D reconstruction method called Polarimetric Multi-View Inverse Rendering (Polarimetric MVIR) that effectively exploits geometric, photometric, and polarimetric cues extracted from input multi-view color polarization images. We first estimate camera poses and an initial 3D model by geometric reconstruction with a standard structure-from-motion and multi-view stereo pipeline. We then refine the initial model by optimizing photometric and polarimetric rendering errors using multi-view RGB and AoP images, where we propose a novel polarimetric rendering cost function that enables us to effectively constrain each estimated surface vertex's normal while considering four possible ambiguous azimuth angles revealed from the AoP measurement. Experimental results using both synthetic and real data demonstrate that our Polarimetric MVIR can reconstruct a detailed 3D shape without assuming a specific polarized reflection depending on the material."



Paperid:1062
Authors:Jing Yu Koh, Duc Thanh Nguyen, Quang-Trung Truong, Sai-Kit Yeung, Alexander Binder
Title: SideInfNet: A Deep Neural Network for Semi-Automatic Semantic Segmentation with Side Information
Abstract:
Fully-automatic execution is the ultimate goal for many Computer Vision applications. However, this objective is not always realistic in tasks associated with high failure costs, such as medical applications. For these tasks, semi-automatic methods allowing minimal effort from users to guide computer algorithms are often preferred due to desirable accuracy and performance. Inspired by the practicality and applicability of the semi-automatic approach, this paper proposes a novel deep neural network architecture, namely SideInfNet that effectively integrates features learnt from images with side information extracted from user annotations. To evaluate our method, we applied the proposed network to three semantic segmentation tasks and conducted extensive experiments on benchmark datasets. Experimental results and comparison with prior work have verified the superiority of our model, suggesting the generality and effectiveness of the model in semi-automatic semantic segmentation. "



Paperid:1063
Authors:Aruni RoyChowdhury, Xiang Yu, Kihyuk Sohn, Erik Learned-Miller, Manmohan Chandraker
Title: Improving Face Recognition by Clustering Unlabeled Faces in the Wild
Abstract:
While deep face recognition has benefited significantly from large-scale labeled data, current research is focused on leveraging unlabeled data to further boost performance, reducing the cost of human annotation. Prior work has mostly been in controlled settings, where the labeled and unlabeled data sets have no overlapping identities by construction. This is not realistic in large-scale face recognition, where one must contend with such overlaps, the frequency of which increases with the volume of data. Ignoring identity overlap leads to significant labeling noise, as data from the same identity is split into multiple clusters. To address this, we propose a novel identity separation method based on extreme value theory. It is formulated as an out-of-distribution detection algorithm, and greatly reduces the problems caused by overlapping-identity label noise. Considering cluster assignments as pseudo-labels, we must also overcome the labeling noise from clustering errors. We propose a modulation of the cosine loss, where the modulation weights correspond to an estimate of clustering uncertainty. Extensive experiments on both controlled and real settings demonstrate our method's consistent improvements over supervised baselines, e.g., 11.6% improvement on IJB-A verification."



Paperid:1064
Authors:Pulak Purkait, Tat-Jun Chin, Ian Reid
Title: NeuRoRA: Neural Robust Rotation Averaging
Abstract:
Multiple rotation averaging is an essential task for structure from motion, mapping, and robot navigation. The task is to estimate the absolute orientations of several cameras given some of their noisy relative orientation measurements. The conventional methods for this task seek parameters of the absolute orientations that agree best with the observed noisy measurements according to a robust cost function. These robust cost functions are highly nonlinear and are designed based on certain assumptions about the noise and outlier distributions. In this work, we aim to build a neural network that learns the noise patterns from the data and predict/regress the model parameters from the noisy relative orientations. The proposed network is a combination of two networks: (1) a view-graph cleaning network, which detects outlier edges in the view-graph and rectifies noisy measurements; and (2) a fine-tuning network, which fine-tunes an initialization of absolute orientations bootstrapped from the cleaned graph, in a single step. The proposed combined network is very fast, moreover, being trained on a large number of synthetic graphs, it is more accurate than the conventional iterative optimization methods. Although the idea of replacing robust optimization methods by a graph-based network is demonstrated only for multiple rotation averaging, it could easily be extended to other graph-based geometric problems, for example, pose-graph optimization. "



Paperid:1065
Authors:Pulak Purkait, Christopher Zach, Ian Reid
Title: SG-VAE: Scene Grammar Variational Autoencoder to generate new indoor scenes
Abstract:
Deep generative models have been used in recent years to learn coherent latent representations in order to synthesize high-quality images. In this work, we propose a neural network to learn a generative model for sampling consistent indoor scene layouts. Our method learns the co-occurrences, and appearance parameters such as shape and pose, for different objects categories through a grammar-based auto-encoder, resulting in a compact and accurate representation for scene layouts. In contrast to existing grammar-based methods with a user-specified grammar, we construct the grammar automatically by extracting a set of production rules on reasoning about object co-occurrences in training data. The extracted grammar is able to represent a scene by an augmented parse tree. The proposed auto-encoder encodes these parse trees to a latent code, and decodes the latent code to a parse tree, thereby ensuring the generated scene is always valid. We experimentally demonstrate that the proposed auto-encoder learns not only to generate valid scenes (i.e. the arrangements and appearances of objects), but it also learns coherent latent representations where nearby latent samples decode to similar scene outputs. The obtained generative model is applicable to several computer vision tasks such as 3D pose and layout estimation from RGB-D data. "



Paperid:1066
Authors:Woobin Im, Tae-Kyun Kim, Sung-Eui Yoon
Title: Unsupervised Learning of Optical Flow with Deep Feature Similarity
Abstract:
Deep unsupervised learning for optical flow has been proposed, where the loss measures image similarity with the warping function parameterized by estimated flow. The census transform, instead of image pixel values, is often used for the image similarity. In this work, rather than the handcrafted features i.e. census or pixel values, we propose to use deep self-supervised features with a novel similarity measure, which fuses multi-layer similarities. With the fused similarity, our network better learns flow by minimizing our proposed feature separation loss. The proposed method is a polarizing scheme, resulting in a more discriminative similarity map. In the process, the features are also updated to get high similarity for matching pairs and low for uncertain pairs, given estimated flow. We evaluate our method on FlyingChairs, MPI Sintel, and KITTI benchmarks. In quantitative and qualitative comparisons, our method effectively improves the state-of-the-art techniques."



Paperid:1067
Authors:Xiaomei Zhang, Yingying Chen, Bingke Zhu, Jinqiao Wang, Ming Tang
Title: Blended Grammar Network for Human Parsing
Abstract:
Although human parsing has made great progress, it still faces a challenge, i.e., how to extract the whole foreground from similar or cluttered scenes effectively. In this paper, we propose a Blended Grammar Network (BGNet), to deal with the challenge. BGNet exploits the inherent hierarchical structure of a human body and the relationship of different human parts by means of grammar rules in both cascaded and paralleled manner. In this way, conspicuous parts, which are easily distinguished from the background, can amend the segmentation of inconspicuous ones, improving the foreground extraction. We also design a Part-aware Convolutional Recurrent Neural Network (PCRNN) to pass messages which are generated by grammar rules. To train PCRNNs effectively, we present a blended grammar loss to supervise the training of PCRNNs. We conduct extensive experiments to evaluate BGNet on PASCAL-Person-Part, LIP, and PPSS datasets. BGNet obtains state-of-the-art performance on these human parsing datasets."



Paperid:1068
Authors:Zehao Yu, Lei Jin, Shenghua Gao
Title: P²Net: Patch-match and Plane-regularization for Unsupervised Indoor Depth Estimation
Abstract:
This paper tackles the unsupervised depth estimation task in indoor environments. The task is extremely challenging because of the vast areas of non-texture regions in these scenes. These areas could overwhelm the optimization process in the commonly used unsupervised depth estimation framework proposed for outdoor environments. However, even when those regions are masked out, the performance is still unsatisfactory. In this paper, we argue that the poor performance suffers from the non-discriminative point-based matching. To this end, we propose P$^2$Net. We first extract points with large local gradients and adopt patches centered at each point as its representation. Multiview consistency loss is then defined over patches. This operation significantly improves the robustness of the network training. Furthermore, because those textureless regions in indoor scenes (g, wall, floor, roof, tc) usually correspond to planar regions, we propose to leverage superpixels as a plane prior. We enforce the predicted depth to be well fitted by a plane within each superpixel. Extensive experiments on NYUv2 and ScanNet show that our P$^2$Net outperforms existing approaches by a large margin."



Paperid:1069
Authors:Van-Quang Nguyen, Masanori Suganuma, Takayuki Okatani
Title: Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs
Abstract:
It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a question, and a dialog history, or even its individual dialog components. In this paper, we present a neural architecture named Light-weight Transformer for Many Inputs (LTMI) that can efficiently deal with all the interactions between multiple such inputs in visual dialog. It has a block structure similar to the Transformer and employs the same design of attention computation, whereas it has only a small number of parameters, yet has sufficient representational power for the purpose. Assuming a standard setting of visual dialog, a layer built upon the proposed attention block has less than one-tenth of parameters as compared with its counterpart, a natural Transformer extension. The experimental results on the VisDial datasets validate the effectiveness of the proposed approach, showing improvements of the best NDCG score on the VisDial v1.0 dataset from 57.59 to 60.92 with a single model, from 64.47 to 66.53 with ensemble models, and even to 74.88 with additional finetuning."



Paperid:1070
Authors:Xiyang Liu, Jie Yang, Wenrui Ding, Tieqiang Wang, Zhijin Wang, Junjun Xiong
Title: Adaptive Mixture Regression Network with Local Counting Map for Crowd Counting
Abstract:
The crowd counting task aims at estimating the number of people located in an image or a frame from videos. Existing methods widely adopt density maps as the training targets to optimize the point-to-point loss. While in testing phase, we only focus on the differences between the crowd numbers and the global summation of density maps, which indicate the inconsistency between the training targets and the evaluation criteria. To solve this problem, we introduce a new target, named local counting map (LCM), to obtain more accurate results than density map based approaches. Moreover, we also propose an adaptive mixture regression framework with three modules in a coarse-to-fine manner to further improve the precision of the crowd estimation: scale-aware module (SAM), mixture regression module (MRM) and adaptive soft interval module (ASIM). Specifically, SAM fully utilizes the context and multi-scale information from different convolutional features; MRM and ASIM perform more precise counting regression on local patches of images. Compared with current methods, the proposed method reports better performances on the typical datasets. The source code is available at https://github.com/xiyang1012/Local-Crowd-Counting ."



Paperid:1071
Authors:Ziheng Cheng, Ruiying Lu, Zhengjue Wang, Hao Zhang, Bo Chen, Ziyi Meng, Xin Yuan
Title: BIRNAT: Bidirectional Recurrent Neural Networks with Adversarial Training for Video Snapshot Compressive Imaging
Abstract:
We consider the problem of video snapshot compressive imaging (SCI), where multiple high-speed frames are coded by different masks and then summed to a single measurement. This measurement and the modulation masks are fed into our Recurrent Neural Network (RNN) to reconstruct the desired high-speed frames. Our end-to-end sampling and reconstruction system is dubbed BIdirectional Recurrent Neural networks with Adversarial Training (BIRNAT). To our best knowledge, this is the first time that recurrent networks are employed to SCI problem. Our proposed BIRNAT outperforms other deep learning based algorithms and the state-of-the-art optimization based algorithm, DeSCI, through exploiting the underlying correlation of sequential video frames. BIRNAT employs a deep convolutional neural network with Resblock and feature map self-attention (AttRes-CNN) to reconstruct the first frame, based on which bidirectional RNN (Bi-RNN) is utilized to reconstruct the following frames in a sequential manner. To improve the quality of the reconstructed video, BIRNAT is further equipped with the adversarial training besides the mean square error loss. Extensive results on both simulation and real data (from two SCI cameras) demonstrate the superior performance of our BIRNAT system."



Paperid:1072
Authors:Zequn Qin, Huanyu Wang, Xi Li
Title: Ultra Fast Structure-aware Deep Lane Detection
Abstract:
Modern methods mainly regard lane detection as a problem of pixel-wise segmentation, which is struggling to address the problem of challenging scenarios and speed. Inspired by human perception, the recognition of lanes under severe occlusion and extreme lighting conditions is mainly based on contextual and global information. Motivated by this observation, we propose a novel, simple, yet effective formulation aiming at extremely fast speed and challenging scenarios. Specifically, we treat the process of lane detection as a row-based selecting problem using global features. With the help of row-based selecting, our formulation could significantly reduce the computational cost. Using a large receptive field on global features, we could also handle the challenging scenarios. Moreover, based on the formulation, we also propose a structural loss to explicitly model the structure of lanes. Extensive experiments on two lane detection benchmark datasets show that our method could achieve the state-of-the-art performance in terms of both speed and accuracy. A light weight version could even achieve 300+ frames per second with the same resolution, which is at least 4x faster than previous state-of-the-art methods. Our code is available at https://github.com/cfzd/Ultra-Fast-Lane-Detection."



Paperid:1073
Authors:Subin Jeon, Seonghyeon Nam, Seoung Wug Oh, Seon Joo Kim
Title: Cross-Identity Motion Transfer for Arbitrary Objects through Pose-Attentive Video Reassembling
Abstract:
We propose an attention-based networks for transferring motions between arbitrary objects. Given a source image(s) and a driving video, our networks animate the subject in the source images according to the motion in the driving video. In our attention mechanism, dense similarities between the learned keypoints in the source and the driving images are computed in order to retrieve the appearance information from the source images. Taking a different approach from the well-studied warping based models, our attention-based model has several advantages. By reassembling non-locally searched pieces from the source contents, our approach can produce more realistic outputs. Furthermore, our system can make use of multiple observations of the source appearance (e.g. front and sides of faces) to make the results more accurate. To reduce the training-testing discrepancy of the self-supervised learning, a novel cross-identity training scheme is additionally introduced. With the training scheme, our networks is trained to transfer motions between different subjects, as in the real testing scenario. Experimental results validate that our method produces visually pleasing results in various object domains, showing better performances compared to previous works."



Paperid:1074
Authors:Zhenwei He, Lei Zhang
Title: Domain Adaptive Object Detection via Asymmetric Tri-way Faster-RCNN
Abstract:
Conventional object detection models inevitably encounter a performance drop as the domain disparity exists. Unsupervised domain adaptive object detection is proposed recently to reduce the disparity between domains, where the source domain is label-rich while the target domain is label-agnostic. The existing models follow a parameter shared siamese structure for adversarial domain alignment, which, however, easily leads to the collapse and out-of-control risk of the source domain and brings negative impact to feature adaption. The main reason is that the labeling unfairness (asymmetry) between source and target makes the parameter sharing mechanism unable to adapt. Therefore, in order to avoid the source domain collapse risk caused by parameter sharing, we propose an asymmetric tri-way Faster-RCNN (ATF) for domain adaptive object detection. Our ATF model has two distinct merits: 1) A ancillary net supervised by source label is deployed to learn ancillary target features and simultaneously preserve the discrimination of source domain, which enhances the structural discrimination (object classification vs. bounding box regression) of domain alignment. 2) The asymmetric structure consisting of a chief net and an independent ancillary net essentially overcomes the parameter sharing aroused source risk collapse. The adaption safety of the proposed ATF detector is guaranteed. Extensive experiments on a number of datasets, including Cityscapes, Foggy-cityscapes, KITTI, Sim10k, Pascal VOC, Clipart and Watercolor, demonstrate the SOTA performance of our method."



Paperid:1075
Authors:Xiaobo Wang, Tianyu Fu, Shengcai Liao, Shuo Wang, Zhen Lei, Tao Mei
Title: Exclusivity-Consistency Regularized Knowledge Distillation for Face Recognition
Abstract:
Knowledge distillation is an effective tool to compress large pre-trained Convolutional Neural Networks (CNNs) or their ensembles into models applicable to mobile and embedded devices. The success of which mainly comes from two aspects: the designed student network and the exploited knowledge. However, current methods usually suffer from the low-capability of mobile-level student network and the unsatisfactory knowledge for distillation. In this paper, we propose a novel position-aware exclusivity to encourage large diversity among different filters of the same layer to alleviate the low-capability of student network. Moreover, we investigate the effect of several prevailing knowledge for face recognition distillation and conclude that the knowledge of feature consistency is more flexible and preserves much more information than others. Experiments on a variety of face recognition benchmarks have revealed the superiority of our method over the state-of-the-arts."



Paperid:1076
Authors:Ke-Chi Chang, Ren Wang, Hung-Jin Lin, Yu-Lun Liu, Chia-Ping Chen, Yu-Lin Chang, Hwann-Tzong Chen
Title: Learning Camera-Aware Noise Models
Abstract:
Modeling imaging sensor noise is a fundamental problem for image processing and computer vision applications. While most previous works adopt statistical noise models, real-world noise is far more complicated and beyond what these models can describe. To tackle this issue, we propose a data-driven approach, where a generative noise model is learned from real-world noise. The proposed noise model is camera-aware, that is, different noise characteristics of different camera sensors can be learned simultaneously, and a single learned noise model can generate different noise for different camera sensors. Experimental results show that our method quantitatively and qualitatively outperforms existing statistical noise models and learning-based methods. The source code and more results are available at https://arcchang1236.github.io/CA-NoiseGAN/"



Paperid:1077
Authors:Oshri Halimi, Ido Imanuel, Or Litany, Giovanni Trappolini, Emanuele Rodolà, Leonidas Guibas, Ron Kimmel
Title: Towards Precise Completion of Deformable Shapes
Abstract:
According to Aristotle, a philosopher in Ancient Greece, {\it ``the whole is greater than the sum of its parts''}. This statement was adopted to explain human perception by the Gestalt psychology school of thought in the twentieth century. Here, we claim that observing part of an object which was previously acquired as a whole, one could deal with both partial matching and shape completion in a holistic manner. More specifically, given the geometry of a full, articulated object in a given pose, as well as a partial scan of the same object in a different pose, we address the {\it new} problem of matching the part to the whole while simultaneously reconstructing the new pose from its partial observation. Our approach is data-driven, and takes the form of a Siamese autoencoder without the requirement of a consistent vertex labeling at inference time; as such, it can be used on unorganized point clouds as well as on triangle meshes. We demonstrate the practical effectiveness of our model in the applications of single-view deformable shape completion and dense shape correspondence, both on synthetic and real-world geometric data, where we outperform prior work on these tasks by a large margin."



Paperid:1078
Authors:Jiahao Li, Changhao Zhang, Ziyao Xu, Hangning Zhou, Chi Zhang
Title: Iterative Distance-Aware Similarity Matrix Convolution with Mutual-Supervised Point Elimination for Efficient Point Cloud Registration
Abstract:
In this paper, we propose a novel learning-based pipeline for partially overlapping 3D point cloud registration. The proposed model includes an iterative distance-aware similarity matrix convolution module to incorporate information from both the feature and Euclidean space into the pairwise point matching process. These convolution layers learn to match points based on joint information of the entire geometric features and Euclidean offset for each point pair, overcoming the disadvantage of matching by simply taking the inner product of feature vectors. Furthermore, a two-stage learnable point elimination technique is presented to improve computational efficiency and reduce false positive correspondence pairs. A novel mutual-supervision loss is proposed to train the model without extra annotations of keypoints. The pipeline can be easily integrated with both traditional (e.g. FPFH) and learning-based features. Experiments on partially overlapping and noisy point cloud registration show that our method outperforms the current state-of-the-art, while being more computationally efficient."



Paperid:1079
Authors:Amir Rahimi, Amirreza Shaban, Thalaiyasingam Ajanthan, Richard Hartley, Byron Boots
Title: Pairwise Similarity Knowledge Transfer for Weakly Supervised Object Localization
Abstract:
Weakly Supervised Object Localization (WSOL) methods only require image level labels as opposed to expensive bounding box annotations required by fully supervised algorithms. We study the problem of learning localization model on target classes with weakly supervised image labels, helped by a fully annotated source dataset. Typically, a WSOL model is first trained to predict class generic objectness scores on an off-the-shelf fully supervised source dataset and then it is progressively adapted to learn the objects in the weakly supervised target dataset. In this work, we argue that learning only an objectness function is a weak form of knowledge transfer and propose to learn a classwise pairwise similarity function that directly compares two input proposals as well. The combined localization model and the estimated object annotations are jointly learned in an alternating optimization paradigm as is typically done in standard WSOL methods. In contrast to the existing work that learns pairwise similarities, our approach optimizes a unified objective with convergence guarantee and it is computationally efficient for large-scale applications. Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function. For instance, in the ILSVRC dataset, the Correct Localization (CorLoc) performance improves from 72.8% to 78.2% which is a new state-of-the-art for WSOL task in the context of knowledge transfer."



Paperid:1080
Authors:Xin Eric Wang, Vihan Jain, Eugene Ie, William Yang Wang, Zornitsa Kozareva, Sujith Ravi[2]
Title: Environment-agnostic Multitask Learning for Natural Language Grounded Navigation
Abstract:
Recent research efforts enable study for natural language grounded navigation in photo-realistic environments, e.g., following natural language instructions or dialog. However, existing methods tend to overfit training data in seen environments and fail to generalize well in previously unseen environments. To close the gap between seen and unseen environments, we aim at learning a generalized navigation model from two novel perspectives:(1) we introduce a multitask navigation model that can be seamlessly trained on both Vision-Language Navigation (VLN) and Navigation from Dialog History (NDH) tasks, which benefits from richer natural language guidance and effectively transfers knowledge across tasks;(2) we propose to learn environment-agnostic representations for the navigation policy that are invariant among the environments seen during training, thus generalizing better on unseen environments. Extensive experiments show that environment-agnostic multitask learning significantly reduces the performance gap between seen and unseen environments, and the navigation agent trained so outperforms baselines on unseen environments by 16\% (relative measure on success rate) on VLN and 120\% (goal progress) on NDH. Our submission to the CVDN leaderboard establishes a new state-of-the-art for the NDH task on the holdout test set. Code is available at https://github.com/google-research/valan ."



Paperid:1081
Authors:Binghua Li, Chao Li, Feng Duan, Ning Zheng, Qibin Zhao
Title: TPFN: Applying Outer Product along Time to Multimodal Sentiment Analysis Fusion on Incomplete Data
Abstract:
Multimodal sentiment analysis (MSA) has been widely investigated in both computer vision and natural language processing. However, studies on the imperfect data especially with missing values are still far from success and challenging, even though such an issue is ubiquitous in the real world. Although previous works show the promising performance by exploiting the low-rank structures of the fused features, only the first-order statistics of the temporal dynamics are concerned. To this end, we propose a novel network architecture termed Time Product Fusion Network (TPFN), which takes the high-order statistics over both modalities and temporal dynamics into account. We construct the fused features by the outer product along adjacent time-steps, such that richer modal and temporal interactions are utilized. In addition, we claim that the low-rank structures can be obtained by regularizing the Frobenius norm of latent factors instead of the fused features. Experiments on CMU-MOSI and CMU-MOSEI datasets show that TPFN can compete with state-of-the art approaches in multimodal sentiment analysis in cases of both random and structured missing values."



Paperid:1082
Authors:Eu Wern Teh, Terrance DeVries, Graham W. Taylor
Title: ProxyNCA++: Revisiting and Revitalizing Proxy Neighborhood Component Analysis
Abstract:
We consider the problem of distance metric learning (DML), where the task is to learn an effective similarity measure between images. We revisit ProxyNCA and incorporate several enhancements. We find that low temperature scaling is a performance-critical component and explain why it works. Besides, we also discover that Global Max Pooling works better in general when compared to Global Average Pooling. Additionally, our proposed fast moving proxies also addresses small gradient issue of proxies, and this component synergizes well with low temperature scaling and Global Max Pooling. Our enhanced model, called ProxyNCA++, achieves a 22.9 percentage point average improvement of Recall@1 across four different zero-shot retrieval datasets compared to the original ProxyNCA algorithm. Furthermore, we achieve state-of-the-art results on the CUB200, Cars196, Sop, and InShop datasets, achieving Recall@1 scores of 72.2, 90.1, 81.4, and 90.9, respectively. "



Paperid:1083
Authors:Wonkyung Lee, Junghyup Lee, Dohyung Kim, Bumsub Ham
Title: Learning with Privileged Information for Efficient Image Super-Resolution
Abstract:
Convolutional neural networks (CNNs) have allowed remarkable advances in single image super-resolution (SISR) over the last decade. Most SR methods based on CNNs have focused on achieving performance gains in terms of quality metrics, such as PSNR and SSIM, over classical approaches. They typically require a large amount of memory and computational units. FSRCNN, consisting of few numbers of convolutional layers, has shown promising results, while using an extremely small number of network parameters. We introduce in this paper a novel distillation framework, consisting of teacher and student networks, that allows to boost the performance of FSRCNN drastically. To this end, we propose to use ground-truth high-resolution (HR) images as privileged information. The encoder in the teacher learns the degradation process, subsampling of HR images, using an imitation loss. The student and the decoder in the teacher, having the same network architecture as FSRCNN, try to reconstruct HR images. Intermediate features in the decoder, affordable for the student to learn, are transferred to the student through feature distillation. Experimental results on standard benchmarks demonstrate the effectiveness and the generalization ability of our framework, which significantly boosts the performance of FSRCNN as well as other SR methods."



Paperid:1084
Authors:Jianing Li,, Shiliang Zhang
Title: Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive Person Re-Identification
Abstract:
Unsupervised domain adaptive person Re-IDentification (ReID) is challenging because of the large domain gap between source and target domains, as well as the lackage of labeled data on the target domain. This paper tackles this challenge through jointly enforcing visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification. The local one-hot classification assigns images in a training batch with different person IDs, then adopts a Self-Adaptive Classification (SAC) model to classify them. The global multi-class classification is achieved by predicting labels on the entire unlabeled training set with the Memory-based Temporal-guided Cluster (MTC). MTC predicts multi-class labels by considering both visual similarity and temporal consistency to ensure the quality of label prediction. The two classification models are combined in a unified framework, which effectively leverages the unlabeled data for discriminative feature learning. Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks. For example, under unsupervised setting, our method outperforms recent unsupervised domain adaptive methods, which leverage more labels for training."



Paperid:1085
Authors:Mingeun Kang, Kiwon Lee, Yong H. Lee, Changho Suh
Title: Autoencoder-based Graph Construction for Semi-supervised Learning
Abstract:
We consider graph-based semi-supervised learning that leverages a similarity graph across data points to better exploit data structure exposed in unlabeled data. One challenge that arises in this problem context is that conventional matrix completion which can serve to construct a similarity graph entails heavy computational overhead, since it re-trains the graph independently whenever model parameters of an interested classifier are updated. In this paper, we propose a holistic approach that employs a parameterized neural-net-based autoencoder for matrix completion, thereby enabling simultaneous training between models of the classifier and matrix completion. We find that this approach not only speeds up training time (around a three-fold improvement over a prior approach), but also offers a higher prediction accuracy via a more accurate graph estimate. We demonstrate that our algorithm obtains state-of-the-art performances by respectful margins on benchmark datasets: Achieving the error rates of 0.57\% on MNIST with 100 labels; 3.48\% on SVHN with 1000 labels; and 6.87\% on CIFAR-10 with 4000 labels."



Paperid:1086
Authors:Abhijit Kundu, Xiaoqi Yin, Alireza Fathi, David Ross, Brian Brewington, Thomas Funkhouser, Caroline Pantofaru
Title: Virtual Multi-view Fusion for 3D Semantic Segmentation
Abstract:
Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of meshes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enables more effective data augmentation not possible with 2D only methods. Overall, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to prior multiview approaches and competitive with state-of-the-art 3D sparse convolution approaches."



Paperid:1087
Authors:Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, Hanqing Lu
Title: Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition
Abstract:
In skeleton-based action recognition, graph convolutional networks (GCNs) have achieved remarkable success. Nevertheless, how to efficiently model the spatial-temporal skeleton graph without introducing extra computation burden is a challenging problem for industrial deployment. In this paper, we rethink the spatial aggregation in existing GCN-based skeleton action recognition methods and discover that they are limited by coupling aggregation mechanism. Inspired by the decoupling aggregation mechanism in CNNs, we propose decoupling GCN to boost the graph modeling ability with no extra computation, no extra latency, no extra GPU memory cost, and less than 10% extra parameters. Another prevalent problem of GCNs is over-fitting. Although dropout is a widely used regularization technique, it is not effective for GCNs, due to the fact that activation units are correlated between neighbor nodes. We propose DropGraph to discard features in correlated nodes, which is particularly effective on GCNs. Moreover, we introduce an attention-guided drop mechanism to enhance the regularization effect. All our contributions introduce zero extra computation burden at deployment. We conduct experiments on three datasets (NTU-RGBD, NTU-RGBD-120, and Northwestern-UCLA) and exceed the state-of-the-art performance with less computation cost."



Paperid:1088
Authors:Yunhao Ba, Alex Gilbert, Franklin Wang, Jinfa Yang, Rui Chen, Yiqin Wang, Lei Yan, Boxin Shi, Achuta Kadambi
Title: Deep Shape from Polarization
Abstract:
This paper makes a first attempt to bring the Shape from Polarization (SfP) problem to the realm of deep learning. The previous state-of-the-art methods for SfP have been purely physics-based. We see value in these principled models, and blend these physical models as priors into a neural network architecture. This proposed approach achieves results that exceed the previous state-of-the-art on a challenging dataset we introduce. This dataset consists of polarization images taken over a range of object textures, paints, and lighting conditions. We report that our proposed method achieves the lowest test error on each tested condition in our dataset, showing the value of blending data-driven and physics-driven approaches. "



Paperid:1089
Authors:Xingyu Chen, Xuguang Lan, Fuchun Sun, Nanning Zheng
Title: A Boundary Based Out-of-Distribution Classifier for Generalized Zero-Shot Learning
Abstract:
Generalized Zero-Shot Learning (GZSL) is a challenging topic that has promising prospects in many realistic scenarios. Using a gating mechanism that discriminates the unseen samples from the seen samples can decompose the GZSL problem to a conventional Zero-Shot Learning (ZSL) problem and a supervised classification problem. However, training the gate is usually challenging due to the lack of data in the unseen domain. To resolve this problem, in this paper, we propose a boundary based Out-of-Distribution (OOD) classifier which classifies the unseen and seen domains by only using seen samples for training. First, we learn a shared latent space on a unit hyper-sphere where the latent distributions of visual features and semantic attributes are aligned class-wisely. Then we find the boundary and the center of the manifold for each class. By leveraging the class centers and boundaries, the unseen samples can be separated from the seen samples. After that, we use two experts to classify the seen and unseen samples separately. We extensively validate our approach on five popular benchmark datasets including AWA1, AWA2, CUB, FLO and SUN. The experimental results show that our approach surpasses state-of-the-art approaches by a significant margin."



Paperid:1090
Authors:Jianfei Yang, Han Zou, Yuxun Zhou, Zhaoyang Zeng, Lihua Xie ()
Title: Mind the Discriminability: Asymmetric Adversarial Domain Adaptation
Abstract:
Adversarial domain adaptation has made tremendous success by learning domain-invariant feature representations. However, conventional adversarial training pushes two domains together and brings uncertainty to feature learning, which deteriorates the discriminability in the target domain. In this paper, we tackle this problem by designing a simple yet effective scheme, namely Asymmetric Adversarial Domain Adaptation (AADA). We notice that source features preserve great feature discriminability due to full supervision, and therefore a novel asymmetric training scheme is designed to keep the source features fixed and encourage the target features approaching to the source features, which best preserves the feature discriminability learned from source labeled data. This is achieved by an autoencoder-based domain discriminator that only embeds the source domain, while the feature extractor learns to deceive the autoencoder by embedding the target domain. Theoretical justifications corroborate that our method minimizes the domain discrepancy and spectral analysis is employed to quantize the improved feature discriminability. Extensive experiments on several benchmarks validate that our method outperforms existing adversarial domain adaptation methods significantly and demonstrates robustness with respect to hyper-parameter sensitivity."



Paperid:1091
Authors:Zhizhong Han, Guanhui Qiao, Yu-Shen Liu, Matthias Zwicker
Title: SeqXY2SeqZ: Structure Learning for 3D Shapes by Sequentially Predicting 1D Occupancy Segments From 2D Coordinates
Abstract:
Structure learning for 3D shapes is vital for 3D computer vision. State-of-the-art methods show promising results by representing shapes using implicit functions in 3D that are learned using discriminative neural networks. However, learning implicit functions requires dense and irregular sampling in 3D space, which also makes the sampling methods affect the accuracy of shape reconstruction during test. To avoid dense and irregular sampling in 3D, we propose to represent shapes using 2D functions, where the output of the function at each 2D location is a sequence of line segments inside the shape. Our approach leverages the power of functional representations, but without the disadvantage of 3D sampling. Specifically, we use a voxel tubelization to represent a voxel grid as a set of tubes along any one of the X, Y, or Z axes. Each tube can be indexed by its 2D coordinates on the plane spanned by the other two axes. We further simplify each tube into a sequence of occupancy segments. Each occupancy segment consists of successive voxels occupied by the shape, which leads to a simple representation of its 1D start and end location. Given the 2D coordinates of the tube and a shape feature as condition, this representation enables us to learn 3D shape structures by sequentially predicting the start and end locations of each occupancy segment in the tube. We implement this approach using a Seq2Seq model with attention, called SeqXY2SeqZ, which learns the mapping from a sequence of 2D coordinates along two arbitrary axes to a sequence of 1D locations along the third axis. SeqXY2SeqZ not only benefits from the regularity of voxel grids in training and testing, but also achieves high memory efficiency. Our experiments show that SeqXY2SeqZ outperforms the state-of-the-art methods under the widely used benchmarks."



Paperid:1092
Authors:ShiJie Sun, Naveed Akhtar, XiangYu Song, HuanSheng Song, Ajmal Mian , Mubarak Shah
Title: Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking
Abstract:
Deep learning based Multiple Object Tracking (MOT) currently relies on off-the-shelf detectors for tracking-by-detection. This results in deep models that are detector biased and evaluations that are detector influenced. To resolve this issue, we introduce Deep Motion Modeling Network (DMM-Net) that can estimate multiple objects' motion parameters to perform joint detection and association in an end-to-end manner. DMM-Net models object features over multiple frames and simultaneously infers object classes, visibility and their motion parameters. These outputs are readily used to update the tracklets for efficient MOT. DMM-Net achieves PR-MOTA score of 12.80 @ 120+ fps for the popular UA-DETRAC challenge - which is better performance and orders of magnitude faster. We also contribute a synthetic large-scale public dataset Omni-MOT for vehicle tracking that provides precise ground-truth annotations to eliminate the detector influence in MOT evaluation. This 14M+ frames dataset is extendable with our public script. We demonstrate the suitability of Omni-MOT for deep learning with DMM-Net, and also make the source code of our network public."



Paperid:1093
Authors:Feihu Zhang Jin Fang Benjamin Wah Philip Torr
Title: Deep FusionNet for Point Cloud Semantic Segmentation
Abstract:
Many point cloud segmentation methods rely on transferring irregular points into a voxel-based regular representation. Although voxel-based convolutions are useful for feature aggregation, they produce ambiguous or wrong predictions if a voxel contains points from different classes. Other approaches (such as PointNets and point-wise convolutions) can take irregular points for feature learning. But their high memory and computational costs (such as for neighborhood search and ball-querying) limit their ability and accuracy for large-scale point cloud processing. To address these issues, we propose a deep fusion network architecture (FusionNet) with a unique voxel-based ``mini-PointNet'' point cloud representation and a new feature aggregation module (fusion module) for large-scale 3D semantic segmentation. Our FusionNet can learn more accurate point-wise predictions when compared to voxel-based convolutional networks. It can realize more effective feature aggregations with lower memory and computational complexity for large-scale point cloud segmentation when compared to the popular point-wise convolutions. Our experimental results show that FusionNet can take more than one million points on one GPU for training to achieve state-of-the-art accuracy on large-scale Semantic KITTI benchmark."



Paperid:1094
Authors:Bichuan Guo, Jiangtao Wen, Yuxing Han
Title: Deep Material Recognition in Light-Fields via Disentanglement of Spatial and Angular Information
Abstract:
Light-field cameras capture sub-views from multiple perspectives simultaneously, with possibly reflectance variations that can be used to augment material recognition in remote sensing, autonomous driving, etc. Existing approaches for light-field based material recognition suffer from the entanglement between angular and spatial domains, leading to inefficient training which in turn limits their performances. In this paper, we propose an approach that achieves decoupling of angular and spatial information by establishing correspondences in the angular domain, then employs regularization to enforce a rotational invariance. As opposed to relying on the Lambertian surface assumption, we align the angular domain by estimating sub-pixel displacements using the Fourier transform. The network takes sparse inputs, i.e. sub-views along particular directions, to gain structural information about the angular domain. A novel regularization technique further improves generalization by weight sharing and max-pooling among different directions. The proposed approach outperforms any previously reported method on multiple datasets. The accuracy gain over 2D images is improved by a factor of 1.5. Extensive ablation studies are conducted to demonstrate the significance of each component."



Paperid:1095
Authors:Shuo Wang, Yuexiang Li, Kai Ma, Ruhui Ma, Haibing Guan, Yefeng Zheng
Title: Dual Adversarial Network for Deep Active Learning
Abstract:
Active learning, reducing the cost and workload of annotations, attracts increasing attentions from the community. Current active learning approaches commonly adopted uncertainty-based acquisition functions for the data selection due to their effectiveness. However, data selection based on uncertainty suffers from the overlapping problem, i.e., the top-$K$ samples ranked by the uncertainty are similar. In this paper, we investigate the overlapping problem of recent uncertainty-based approaches and propose to alleviate the issue by taking representativeness into consideration. In particular, we propose a dual adversarial network, namely DAAL, for this purpose. Different from previous hybrid active learning methods requiring multi-stage data selections i.e., step-by-step evaluating the uncertainty and representativeness using different acquisition functions, our DAAL learns to select the most uncertain and representative data points in one-stage. Extensive experiments conducted on three publicly available datasets, i.e., CIFAR10/100 and Cityscapes, demonstrate the effectiveness of our method---a new state-of-the-art accuracy is achieved."



Paperid:1096
Authors:Ka Leong Cheng, Zhaoyang Yang, Qifeng Chen, Yu-Wing Tai
Title: Fully Convolutional Networks for Continuous Sign Language Recognition
Abstract:
Continuous sign language recognition (SLR) is a challenging task that requires learning on both spatial and temporal dimensions of signing frame sequences. Most recent work accomplishes this by using CNN and RNN hybrid networks. However, training these networks is generally non-trivial, and most of them fail in learning unseen sequence patterns, causing an unsatisfactory performance for online recognition. In this paper, we propose a fully convolutional network (FCN) for online SLR to concurrently learn spatial and temporal features from weakly annotated video sequences with only sentence-level annotations given. A gloss feature enhancement (GFE) module is introduced in the proposed network to enforce better sequence alignment learning. The proposed network is end-to-end trainable without any pre-training. We conduct experiments on two large scale SLR datasets. Experiments show that our method for continuous SLR is effective and performs well in online recognition."



Paperid:1097
Authors:Matteo Poggi, Filippo Aleotti, Fabio Tosi, Giulio Zaccaroni, Stefano Mattoccia
Title: Self-adapting confidence estimation for stereo
Abstract:
Estimating the confidence of disparity maps inferred by a stereo algorithm has become a very relevant task in the years, due to the increasing number of applications leveraging such cue. Although self-supervised learning has recently spread across many computer vision tasks, it has been barely considered in the field of confidence estimation. In this paper, we propose a flexible and lightweight solution enabling self-adapting confidence estimation agnostic to the stereo algorithm or network. Our approach relies on the minimum information available in any stereo setup (\ie, the input stereo pair and the output disparity map) to learn an effective confidence measure. This strategy allows us not only a seamless integration with any stereo system, including consumer and industrial devices equipped with undisclosed stereo perception methods, but also, due to its self-adapting capability, for its out-of-the-box deployment in the field. Exhaustive experimental results with different standard datasets support our claims, showing how our solution is the first-ever enabling online learning of accurate confidence estimation for any stereo system and without any requirement for the end-user."



Paperid:1098
Authors:Quewei Li, Jie Guo, Yang Fei, Qinyu Tang, Wenxiu Sun, Jin Zeng, Yanwen Guo
Title: Deep Surface Normal Estimation on the 2-Sphere with Confidence Guided Semantic Attention
Abstract:
We propose a deep convolutional neural network (CNN) to estimate surface normal from a single color image accompanied with a low-quality depth channel. Unlike most previous works, we predict the normal on the 2-sphere rather than the 3D Euclidean space, which produces naturally normalized values and makes the training stable. Although the depth information is beneficial for normal estimation, the raw data contain missing values and noises. To alleviate this problem, we employ a confidence guided semantic attention (CGSA) module to progressively improve the quality of depth channel during training. The continuously refined depth features are fused with the normal features at multiple scales with the mutual feature fusion (MFF) modules to fully exploit the correlations between normals and depth, resulting in high quality normals and depth with fine details. Extensive experiments on multiple benchmark datasets prove the superiority of the proposed method."



Paperid:1099
Authors:Hui Zhang, Quanming Yao, Mingkun Yang, Yongchao Xu, Xiang Bai
Title: AutoSTR: Efficient Backbone Search for Scene Text Recognition
Abstract:
Scene text recognition (STR) is challenging due to the diversity of text instances and the complexity of scenes. However, no STR methods can adapt backbones to different diversities and complexities. In this work, inspired by the success of neural architecture search (NAS), we propose automated STR (AutoSTR), which can address the above issue by searching data-dependent backbones. Specifically, we show both choices on operations and the downsampling path are very important in the search space design of NAS. Besides, since no existing NAS algorithms can handle the spatial constraint on the path, we propose a two-step search algorithm, which decouples operations and downsampling path, for an efficient search in the given space. Experiments demonstrate that, by searching data-dependent backbones, AutoSTR can outperform the state-of-the-art approaches on standard benchmarks with much fewer FLOPS and model parameters."



Paperid:1100
Authors:Sungwon Han, Sungwon Park, Sungkyu Park, Sundong Kim, Meeyoung Cha
Title: Mitigating Embedding and Class Assignment Mismatch in Unsupervised Image Classification
Abstract:
Unsupervised image classification is a challenging computer vision task. Deep learning-based algorithms have achieved superb results, where the latest approach adopts unified losses from embedding and class assignment processes. Since these processes inherently have different goals, jointly optimizing them may lead to a suboptimal solution. To address this limitation, we propose a novel two-stage algorithm in which an embedding module for pretraining precedes a refining module that concurrently performs embedding and class assignment. Our model outperforms SOTA when tested with multiple datasets, by substantially high accuracy of 81.0% for the CIFAR-10 dataset (i.e., increased by 19.3 percent points), 35.3% accuracy for CIFAR-100-20 (9.6 pp) and 66.5% accuracy for STL-10 (6.9 pp) in unsupervised tasks."



Paperid:1101
Authors:Weitao Wan, Jiansheng Chen, Ming-Hsuan Yang
Title: Adversarial Training with Bi-directional Likelihood Regularization for Visual Classification
Abstract:
Neural networks are vulnerable to adversarial attacks. Practically, adversarial training is by far the most effective approach for enhancing the robustness of neural networks against adversarial examples. The current adversarial training approach aims to maximize the posterior probability for adversarially perturbed training data. However, such a training strategy ignores the fact that the clean data and adversarial examples should have intrinsically different feature distributions despite that they are assigned with the same class label under adversarial training. We propose that this problem can be solved by explicitly modeling the deep feature distribution, for example as a Gaussian Mixture, and then properly introducing the likelihood regularization into the loss function. Specifically, by maximizing the likelihood of features of clean data and minimizing that of adversarial examples simultaneously, the neural network learns a more reasonable feature distribution in which the intrinsic difference between clean data and adversarial examples can be explicitly preserved. We call such a new robust training strategy the adversarial training with bi-directional likelihood regularization (ATBLR) method. Extensive experiments on various datasets demonstrate that the ATBLR method facilitates robust classification of both clean data and adversarial examples, and performs favorably against previous state-of-the-art methods for robust visual classification."



Paperid:1102
Authors:Ryuichiro Hataya, Zdenek Jan, Kazuki Yoshizoe, Hideki Nakayama
Title: Faster AutoAugment: Learning Augmentation Strategies Using Backpropagation
Abstract:
Data augmentation methods are indispensable heuristics to boost the performance of deep neural networks, especially in image recognition tasks. Recently, several studies have shown that augmentation strategies found by search algorithms outperform hand-made strategies. Such methods employ black-box search algorithms over image transformations with continuous or discrete parameters and require a long time to obtain better strategies. In this paper, we propose a differentiable policy search pipeline for data augmentation, which is much faster than previous methods. We introduce approximate gradients for several transformation operations with discrete parameters as well as a differentiable mechanism for selecting operations. As the objective of training, we minimize the distance between the distributions of augmented and original data, which can be differentiated. We show that our method, Faster AutoAugment, achieves significantly faster searching than prior methods without a performance drop."



Paperid:1103
Authors:Lin Huang, Jianchao Tan, Ji Liu, Junsong Yuan
Title: Hand-Transformer: Non-Autoregressive Structured Modeling for 3D Hand Pose Estimation
Abstract:
3D hand pose estimation is still far from a well-solved problem mainly due to the highly nonlinear dynamics of hand pose and the difficulties of modeling its inherent structural dependencies. To address this issue, we connect this structured output learning problem with the structured modeling framework in sequence transduction field. Standard transduction models like Transformer adopt an autoregressive connection to capture dependencies from previously generated tokens and further correlate this information with the input sequence in order to prioritize the set of relevant input tokens for current token generation. To borrow wisdom from this structured learning framework while avoiding the sequential modeling for hand pose, taking a 3D point set as input, we propose to leverage the Transformer architecture with a novel non-autoregressive structured decoding mechanism. Specifically, instead of using previously generated results, our decoder utilizes a reference hand pose to provide equivalent dependencies among hand joints for each output joint generation. By imposing the reference structural dependencies, we can correlate the information with the input 3D points through a multi-head attention mechanism, aiming to discover informative points from different perspectives, towards each hand joint localization. We demonstrate our model's effectiveness over multiple challenging hand pose datasets, comparing with several state-of-the-art methods."



Paperid:1104
Authors:Zhenzhi Wang, Ziteng Gao, Limin Wang, Zhifeng Li, Gangshan Wu
Title: Boundary-Aware Cascade Networks for Temporal Action Segmentation
Abstract:
Identifying human action segments in an untrimmed video is still challenging due to boundary ambiguity and over-segmentation issues. To address these problems, we present a new boundary-aware cascade network by introducing two novel components. First, we devise a new cascading paradigm, called Stage Cascade, to enable our model to have adaptive receptive fields and more confident predictions for ambiguous frames. Second, we design a general and principled smoothing operation, termed as local barrier pooling, to aggregate local predictions by leveraging semantic boundary information. Moreover, these two components can be jointly fine-tuned in an end-to-end manner. We perform experiments on three challenging datasets: 50Salads, GTEA and Breakfast dataset, demonstrating that our framework significantly out-performs the current state-of-the-art methods. The code is available at https://github.com/MCG-NJU/BCN."



Paperid:1105
Authors:Xu Yan, Weibing Zhao, Kun Yuan, Ruimao Zhang, Zhen Li, Shuguang Cui
Title: Towards Content-Independent Multi-Reference Super-Resolution: Adaptive Pattern Matching and Feature Aggregation
Abstract:
Recovering realistic textures from a largely down-sampled low resolution (LR) image with complicated patterns is a challenging problem in image super-resolution. This work investigates a novel multi-reference based super-resolution problem by proposing a Content Independent Multi-Reference Super-Resolution (CIMR-SR) model, which is able to adaptively match the visual pattern between references and target image in the low resolution and enhance the feature representation of the target image in the higher resolution. CIMR-SR significantly improves the flexibility of the recently proposed reference-based super-resolution (RefSR), which needs to select the specific high-resolution reference (e.g., content similarity, camera view and relative scale) for each target image. In practice, a universal reference pool (RP) is built up for recovering all LR targets by searching the local matched patterns. By exploiting feature-based patch searching and attentive reference feature aggregation, the proposed CIMR-SR generates realistic images with much better perceptual quality and richer fine-details. Extensive experiments demonstrate the proposed CIMR-SR outperforms state-of-the-art methods in both qualitative and quantitative reconstructions."



Paperid:1106
Authors:Yael Konforti, Alon Shpigler, Boaz Lerner, Aharon Bar-Hillel
Title: Inference Graphs for CNN Interpretation
Abstract:
Convolutional neural networks (CNNs) have achieved superior accuracy in many visual related tasks. However, the inference process through intermediate layers is opaque, making it difficult to interpret such networks or develop trust in their operation. We propose to model the network hidden layers activity using probabilistic models. The activity patterns in layers of interest are modeled as Gaussian mixture models, and transition probabilities between clusters in consecutive modeled layers are estimated. Based on maximum-likelihood considerations, a subset of the nodes and paths relevant for network prediction are chosen, connected, and visualized as an inference graph. We show that such graphs are useful for understanding the general inference process of a class, as well as explaining decisions the network makes regarding specific images. In addition, the models provide an interesting observation regarding the highly local nature of column activities in top CNN layers."



Paperid:1107
Authors:Liangcheng Li, Feiyu Gao, Jiajun Bu, Yongpan Wang, Zhi Yu, Qi Zheng
Title: An End-to-End OCR Text Re-organization Sequence Learning for Rich-text Detail Image Comprehension
Abstract:
Nowadays rich description on detail images help users know more about the commodities. With the help of OCR technology, the description text can be detected and recognized as auxiliary information to remove the comprehending barriers among the visual impaired users. However, for lack of proper logical structure among these OCR text blocks, it is challenging to comprehend the detail images accurately. To tackle the above problems, we propose a novel end-to-end OCR text reorganizing model. Specifically, we create a Graph Neural Network with an attention map to encode the text blocks with visual layout features, with which an attention based sequence decoder inspired by the Pointer Network and a Sinkhorn global optimization will reorder the OCR text into a proper sequence. Experimental results illustrate that our model outperforms the other baselines, and the real experiment of the blind users' experience shows that our model improves their comprehension."



Paperid:1108
Authors:Yang Bai, Yuyuan Zeng, Yong Jiang, Yisen Wang, Shu-Tao Xia, Weiwei Guo
Title: Improving Query Efficiency of Black-box Adversarial Attack
Abstract:
Deep neural networks (DNNs) have demonstrated excellent performance on various tasks, however they are under the risk of adversarial examples that can be easily generated when the target model is accessible to an attacker (white-box setting). As plenty of machine learning models have been deployed via online services that only provide query outputs from inaccessible models (e.g., Google Cloud Vision API2), black-box adversarial attacks (inaccessible target model) are of critical security concerns in practice rather than white-box ones. However, existing query-based black-box adversarial attacks often require excessive model queries to maintain a high attack success rate. Therefore, in order to improve query efficiency, we explore the distribution of adversarial examples around benign inputs with the help of image structure information characterized by a Neural Process, and propose a Neural Process based black-box adversarial attack (NP-Attack) in this paper. Extensive experiments show that NP-Attack could greatly decrease the query counts and achieve the highest attack success rate simultaneously under the black-box setting. "



Paperid:1109
Authors:Hsien-Tzu Cheng, Chun-Fu Yeh, Po-Chen Kuo, Andy Wei, Keng-Chi Liu, Mong-Chi Ko, Kuan-Hua Chao, Yu-Ching Peng, Tyng-Luh Liu
Title: Self-similarity Student for Partial Label Histopathology Image Segmentation
Abstract:
Delineation of cancerous regions in gigapixel whole slide images (WSIs) is a crucial diagnostic procedure in digital pathology. This process is time-consuming because of the large search space in the gigapixel WSIs, causing chances of omission and misinterpretation at indistinct tumor lesions. To tackle this, the development of an automated cancerous region segmentation method is imperative. We frame this issue as a modeling problem with partial label WSIs, where some cancerous regions may be misclassified as benign and vice versa, producing patches with noisy labels. To learn from these patches, we propose Self-similarity Student, combining teacher-student model paradigm with similarity learning. Specifically, for each patch, we first sample its similar and dissimilar patches according to spatial distance. A teacher-student model is then introduced, featuring the exponential moving average on both student model weights and teacher predictions ensemble. While our student model takes patches, teacher model takes all their corresponding similar and dissimilar patches for learning robust representation against noisy label patches. Following this similarity learning, our similarity ensemble merges similar patches’ ensembled predictions as the pseudo-label of a given patch to counteract its noisy label. On the CAMELYON16 dataset, our method substantially outperforms state-of-the-art noise-aware learning methods by 5% and the supervised-trained baseline by 10% in various degrees of noise. Moreover, our method is superior to the baseline on our TVGH TURP dataset with 2% improvement, demonstrating the generalizability to more clinical histopathology segmentation tasks."



Paperid:1110
Authors:Arslan Ali, Matteo Testa, Tiziano Bianchi, Enrico Magli
Title: BioMetricNet: deep unconstrained face verification through learning of metrics regularized onto Gaussian distributions
Abstract:
We present BioMetricNet: a novel framework for deep unconstrained face verification which learns a regularized metric to compare facial features. Differently from popular methods such as FaceNet, the proposed approach does not impose any specific metric on facial features; instead, it shapes the decision space by learning a latent representation in which matching and non-matching pairs are mapped onto clearly separated and well-behaved target distributions. In particular, the network jointly learns the best feature representation, and the best metric that follows the target distributions, to be used to discriminate face images. In this paper we present this general framework, first of its kind for facial verification, and tailor it to Gaussian distributions. This choice enables the use of a simple linear decision boundary that can be tuned to achieve the desired trade-off between false alarm and genuine acceptance rate, and leads to a loss function that can be written in closed form. Extensive analysis and experimentation on publicly available datasets such as Labeled Faces in the wild (LFW), Youtube faces (YTF), Celebrities in Frontal-Profile in the Wild (CFP), and challenging datasets like cross-age LFW (CALFW), cross-pose LFW (CPLFW), In-the-wild Age Dataset (AgeDB) show a significant performance improvement and confirms the effectiveness and superiority of BioMetricNet over existing state-of-the-art methods."



Paperid:1111
Authors:Zhetong Liang, Shi Guo, Hong Gu, Huaqi Zhang, Lei Zhang
Title: A Decoupled Learning Scheme for Real-world Burst Denoising from Raw Images
Abstract:
The recently developed burst denoising approach, which reduces noise by using multiple frames captured in a short time, has demonstrated much better denoising performance than its single-frame counterparts. However, existing learning based burst denoising methods are limited by two factors. On one hand, most of the models are trained on video sequences with synthetic noise. When applied to real-world raw image sequences, visual artifacts often appear due to the different noise statistics. On the other hand, there lacks a real-world burst denoising benchmark of dynamic scenes because the generation of clean ground-truth is very difficult due to the presence of object motions. In this paper, a novel multi-frame CNN model is carefully designed, which decouples the learning of motion from the learning of noise statistics. Consequently, an alternating learning algorithm is developed to learn how to align adjacent frames from a synthetic noisy video dataset, and learn to adapt to the raw noise statistics from real-world noisy datasets of static scenes. Finally, the trained model can be applied to real-world dynamic sequences for burst denoising. Extensive experiments on both synthetic video datasets and real-world dynamic sequences demonstrate the leading burst denoising performance of our proposed method."



Paperid:1112
Authors:Yunjae Jung, Donghyeon Cho, Sanghyun Woo, In So Kweon
Title: Global-and-Local Relative Position Embedding for Unsupervised Video Summarization
Abstract:
In order to summarize a content video properly, it is important to grasp the sequential structure of video as well as the long-term dependency between frames. The necessity of them is more obvious, especially for unsupervised learning. One possible solution is to utilize a well-known technique in the field of natural language processing for long-term dependency and sequential property: self-attention with relative position embedding (RPE). However, compared to natural language processing, video summarization requires capturing a much longer length of the global context. In this paper, we therefore present a novel input decomposition strategy, which samples the input both globally and locally. This provides an effective temporal window for RPE to operate and improves overall computational efficiency significantly. By combining both Global-and-Local input decomposition and RPE together, we come up with GL-RPE. Our approach allows the network to capture both local and global interdependencies between video frames effectively. Since GL-RPE can be easily integrated into the existing methods, we apply it to two different unsupervised backbones. We provide extensive ablation studies and visual analysis to verify the effectiveness of the proposals. We demonstrate our approach achieves new state-of-the-art performance using the recently proposed rank order-based metrics: Kendall's $ au$ and Spearman's $ ho$. Furthermore, despite our method is unsupervised, we show ours perform on par with the fully-supervised method."



Paperid:1113
Authors:Jaesung Rim, Haeyun Lee, Jucheol Won, Sunghyun Cho
Title: Real-World Blur Dataset for Learning and Benchmarking Deblurring Algorithms
Abstract:
Numerous learning-based approaches to single image deblurring for camera and object motion blurs have recently been proposed. To generalize such approaches to real-world blurs, large datasets of real blurred images and their ground truth sharp images are essential. However, there are still no such datasets, thus all the existing approaches resort to synthetic ones, which leads to the failure of deblurring real-world images. In this work, we present a large-scale dataset of real-world blurred images and ground truth sharp images for learning and benchmarking single image deblurring methods. To collect our dataset, we build an image acquisition system to simultaneously capture geometrically aligned pairs of blurred and sharp images, and develop a postprocessing method to produce high-quality ground truth images. We analyze the effect of our postprocessing method and the performance of existing deblurring methods. Our analysis shows that our dataset significantly improves deblurring quality for real-world blurred images."



Paperid:1114
Authors:Qing Guo, Xiaofei Xie, Felix Juefei-Xu, Lei Ma, Zhongguo Li, Wanli Xue, Wei Feng, Yang Liu
Title: SPARK: Spatial-aware Online Incremental Attack Against Visual Tracking
Abstract:
Adversarial attacks of deep neural networks have been intensively studied on image, audio, natural language, patch, and pixel classification tasks. Nevertheless, as a typical, while important real-world application, the adversarial attacks of online video object tracking that traces an object's moving trajectory instead of its category are rarely explored. In this paper, we identify a new task for the adversarial attack to visual tracking: online generating imperceptible perturbations that mislead trackers along with an incorrect (Untargeted Attack, UA) or specified trajectory (Targeted Attack, TA). To this end, we first propose a spatial-aware basic attack by adapting existing attack methods, \ie, FGSM, BIM, and C\&W, and comprehensively analyze the attacking performance. We identify that online object tracking poses two new challenges: 1) it is difficult to generate imperceptible perturbations that can transfer across frames, and 2) real-time trackers require the attack to satisfy a certain level of efficiency. To address these challenges, we further propose the spatial-aware online incremental attack (a.k.a. SPARK) that performs spatial-temporal sparse incremental perturbations online and makes the adversarial attack less perceptible. In addition, as an optimization-based method, SPARK quickly converges to very small losses within several iterations by considering historical incremental perturbations, making it much more efficient than basic attacks. The in-depth evaluation of state-of-the-art trackers (i.e., SiamRPN++ with AlexNet, MobileNetv2, and ResNet-50, and SiamDW) on OTB100, VOT2018, UAV123, and LaSOT demonstrates the effectiveness and transferability of SPARK in misleading the trackers under both UA and TA with minor perturbations."



Paperid:1115
Authors:Zhujun Xu, Emir Hrustic, Damien Vivet
Title: CenterNet Heatmap Propagation for Real-time Video Object Detection
Abstract:
The existing methods for video object detection mainly depend on two-stage image object detectors. The fact that two-stage detectors are generally slow makes it difficult to apply in real-time scenarios. Moreover, adapting directly existing methods to a one-stage detector is inefficient or infeasible. In this work, we introduce a method based on a one-stage detector called CenterNet. We propagate the previous reliable long-term detection in the form of heatmap to boost results of upcoming image. Our method achieves the online real-time performance on ImageNet VID dataset with 76.7% mAP at 37 FPS and the offline performance 78.4% mAP at 34 FPS."



Paperid:1116
Authors:Youwei Pang, Lihe Zhang, Xiaoqi Zhao, Huchuan Lu
Title: Hierarchical Dynamic Filtering Network for RGB-D Salient Object Detection
Abstract:
The main purpose of RGB-D salient object detection (SOD) is how to better integrate and utilize cross-modal fusion information. In this paper, we explore these issues from a new perspective. We integrate the features of different modalities through densely connected structures and use their mixed features to generate dynamic filters with receptive fields of different sizes. In the end, we implement a kind of more flexible and efficient multi-scale cross-modal feature processing, i.e. dynamic dilated pyramid module. In order to make the predictions have sharper edges and consistent saliency regions, we design a hybrid enhanced loss function to further optimize the results. This loss function is also validated to be effective in the single-modal RGB SOD task. In terms of six metrics, the proposed method outperforms the existing twelve methods on eight challenging benchmark datasets. A large number of experiments verify the effectiveness of the proposed module and loss function. Our code, model and results are available at \url{https://github.com/lartpang/HDFNet}."



Paperid:1117
Authors:Tony Ng, Vassileios Balntas, Yurun Tian, Krystian Mikolajczyk
Title: SOLAR: Second-Order Loss and Attention for Image Retrieval
Abstract:
Recent works in deep-learning have shown that second-order information is beneficial in many computer-vision tasks. Second-order information can be enforced both in the spatial context and the abstract feature dimensions. In this work, we explore two second-order components. One is focused on second-order spatial information to increase the performance of image descriptors, both local and global. It is used to re-weight feature maps, and thus emphasise salient image locations that are subsequently used for description. The second component is concerned with a second-order similarity (SOS) loss, that we extend to global descriptors for image retrieval, and is used to enhance the triplet loss with hard-negative mining. We validate our approach on two different tasks and datasets for image retrieval and image matching. The results show that our two second-order components complement each other, bringing significant performance improvements in both tasks and lead to state-of-the-art results across the public benchmarks. Code available at: http://github.com/tonyngjichun/SOLAR"



Paperid:1118
Authors:Guolei Sun, Salman Khan, Wen Li, Hisham Cholakkal, Fahad Shahbaz Khan, Luc Van Gool
Title: Fixing Localization Errors to Improve Image Classification
Abstract:
Deep neural networks are generally considered black-box models that offer less interpretability for their decision process. To address this limitation, Class Activation Map (CAM) provides an attractive solution that visualizes class-specific discriminative regions in an input image. The remarkable ability of CAMs to locate class discriminating regions has been exploited in weakly-supervised segmentation and localization tasks. In this work, we explore a new direction towards the possible use of CAM in deep network learning process. We note that such visualizations lend insights into the workings of deep CNNs and could be leveraged to introduce additional constraints during the learning stage. Specifically, the CAMs for negative classes (negative CAMs) often have false activations even though those classes are absent from an image. Thereby, we propose a loss function that seeks to minimize peaks within the negative CAMs, called Homogeneous Negative CAM loss. This way, in an effort to fix localization errors, our loss provides an extra supervisory signal that helps the model to better discriminate between similar classes. Our designed loss function is easy to implement and can be readily integrated into existing DNNs. We evaluate it on a number of classification tasks including large-scale recognition, multi-label classification and fine-grained recognition. Our loss provides better performance compared to other loss functions across the studied tasks. Additionally, we show that the proposed loss function provides higher robustness against adversarial attacks and noisy labels. "



Paperid:1119
Authors:Lisa Mais, Peter Hirsch and Dagmar Kainmueller
Title: PatchPerPix for Instance Segmentation
Abstract:
In this paper we present a novel method for proposal free instance segmentation that can handle sophisticated object shapes that span large parts of an image and form dense object clusters with crossovers. Our method is based on predicting dense local shape descriptors, which we assemble to form instances. All instances are assembled simultaneously in one go. To our knowledge, our method is the first non-iterative method that yields instances that are composed of learnt shape patches. We evaluate our method on a diverse range of data domains, where it defines the new state of the art on four benchmarks, namely the ISBI 2012 EM segmentation benchmark, the BBBC010 C. elegans dataset, and 2d as well as 3d fluorescence microscopy datasets of cell nuclei. We show furthermore that our method also applies to 3d light microscopy data of drosophila neurons, which exhibit extreme cases of complex shape clusters."



Paperid:1120
Authors:Soroush Seifi, Tinne Tuytelaars
Title: Attend and Segment: Attention Guided Active Semantic Segmentation
Abstract:
In a dynamic environment, an agent with a limited field of view/resource cannot fully observe the scene before attempting to parse it. The deployment of common semantic segmentation architectures is not feasible in such settings. In this paper we propose a method to gradually segment a scene given a sequence of partial observations. The main idea is to refine an agent's understanding of the environment by attending the areas it is most uncertain about. Our method includes a self-supervised attention mechanism and a specialized architecture to maintain and exploit spatial memory maps for filling-in the unseen areas in the environment. The agent can select and attend an area while relying on the cues coming from the visited areas to hallucinate the other parts. We reach a mean pixel-wise accuracy of 78.1%, 80.9% and 76.5% on CityScapes, CamVid, and Kitti datasets by processing only 18% of the image pixels (10 retina-like glimpses). We perform an ablation study on the number of glimpses, input image size and effectiveness of retina-like glimpses. We compare our method to several baselines and show that the optimal results are achieved by having access to a very low resolution view of the scene at the first timestep."



Paperid:1121
Authors:Xucheng Ye, Pengcheng Dai, Junyu Luo, Xin Guo, Yingjie Qi, Jianlei Yang, Yiran Chen
Title: Accelerating CNN Training by Pruning Activation Gradients
Abstract:
Sparsification is an efficient approach to accelerate CNN inference, but it is challenging to take advantage of sparsity in training procedure because the involved gradients are dynamically changed. Actually, an important observation shows that most of the activation gradients in back-propagation are very close to zero and only have a tiny impact on weight-updating. Hence, we consider pruning these very small gradients randomly to accelerate CNN training according to the statistical distribution of activation gradients. Meanwhile, we theoretically analyze the impact of pruning algorithm on the convergence. The proposed approach is evaluated on AlexNet and ResNet-\{18, 34, 50, 101, 152\} with CIFAR-\{10, 100\} and ImageNet datasets. Experimental results show that our training approach could substantially achieve up to $5.92 imes$ speedups at back-propagation stage with negligible accuracy loss."



Paperid:1122
Authors:Han-Ul Kim, Young Jun Koh, Chang-Su Kim
Title: Global and Local Enhancement Networks for Paired and Unpaired Image Enhancement
Abstract:
A novel approach for paired and unpaired image enhancement is proposed in this work. First, we develop global enhancement network (GEN) and local enhancement network (LEN), which can faithfully enhance images. The proposed GEN performs the channel-wise intensity transforms that can be trained easier than the pixel-wise prediction. The proposed LEN refines GEN results based on spatial filtering. Second, we propose different training schemes for paired learning and unpaired learning to train GEN and LEN. Especially, we propose a two-stage training scheme based on generative adversarial networks for unpaired learning. Experimental results demonstrate that the proposed algorithm outperforms the state-of-the-arts in paired and unpaired image enhancement. Notably, the proposed unpaired image enhancement algorithm provides better results than recent state-of-the-art paired image enhancement algorithms. The source codes and trained models are available at https://github.com/hukim1124/GleNet."



Paperid:1123
Authors:Kang Kim, Hee Seok Lee
Title: Probabilistic Anchor Assignment with IoU Prediction for Object Detection
Abstract:
In object detection, determining which anchors to assign as positive or negative samples, known as anchor assignment, has been revealed as a core procedure that can significantly affect the model's performance. In this paper we propose a novel anchor assignment strategy that adaptively separates anchors into positive and negative samples for a ground truth bounding box according to the model's learning status such that it is able to reason the separation in a probabilistic manner. To do so we first calculate the score of anchors conditioned on the model's learning status and fit these scores to a probability distribution. The model is then trained with anchors separated into positive and negative samples according to their probabilities. Moreover, We investigate the gap between objectives of training and testing and propose to predict Intersection-over-Unions of detected boxes as a measure of the localization quality to reduce the discrepancy. The combined score of classification and localization qualities serving as a box selection metric in non-maximum suppression well aligns with the proposed anchor assignment strategy and leads significant performance improvements. The proposed methods only add a single convolutional layer to RetinaNet baseline and does not require multiple anchors per location, so are efficient. Experimental results verify the effectiveness of the proposed methods. Especially, our models set new records for single-stage detectors on MS COCO test-dev dataset for various backbones."



Paperid:1124
Authors:Yating Wang, Quan Wang, Feng Xu
Title: Eyeglasses 3D shape reconstruction from a single face image
Abstract:
A complete 3D face reconstruction requires to explicitly model the eyeglasses on the face, which is less investigated in the literature. In this paper, we present an automatic system that recovers the 3D shape of eyeglasses from a single face image with an arbitrary head pose. To achieve this goal, our system first trains a neural network to jointly perform glasses landmark detection and segmentation, which carry the sparse and dense glasses information respectively for 3D glasses pose estimation and shape recovery. To solve the ambiguity in 2D to 3D reconstruction, our system fully explores the prior knowledge including the relative motion constraint between face and glasses and the planar and symmetric shape prior feature of glasses. From the qualitative and quantitative experiments, we see that our system reconstructs promising 3D shapes of eyeglasses for various poses."



Paperid:1125
Authors:Ruibing Hou, Hong Chang, Bingpeng Ma, Shiguang Shan, Xilin Chen
Title: Temporal Complementary Learning for Video Person Re-Identification
Abstract:
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification. Firstly, we introduce a Temporal Saliency Erasing (TSE) module including a saliency erasing operation and a series of ordered learners. Specifically, for a specific frame of a video, the saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames. Such that the diverse visual features can be discovered for consecutive frames and finally form an integral characteristic of the target identity. Furthermore, a Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature. It is complementary to TSE by effectively alleviating the information loss caused by the erasing operation of TSE. Extensive experiments show our method performs favorably against state-of-the-arts. The source code is available at https://github.com/blue-blue272/VideoReID-TCLNet."



Paperid:1126
Authors:Nermin Samet, Samet Hicsonmez, Emre Akbas
Title: HoughNet: Integrating near and long-range evidence for bottom-up object detection
Abstract:
This paper presents HoughNet, a one-stage, anchor-free, voting-based, bottom-up object detection method. Inspired by the Generalized Hough Transform, HoughNet determines the presence of an object at a certain location by the sum of the votes cast on that location. Votes are collected from both near and long-distance locations based on a log-polar vote field. Thanks to this voting mechanism, HoughNet is able to integrate both near and long-range, class-conditional evidence for visual recognition, thereby generalizing and enhancing current object detection methodology, which typically relies on only local evidence. On the COCO dataset, HoughNet’s best model achieves $46.4$ $AP$ (and $65.1$ $AP_{50}$), performing on par with the state-of-the-art in bottom-up object detection and outperforming most major one-stage and two-stage methods. We further validate the effectiveness of our proposal in another task, namely, ""labels to photo"" image generation by integrating the voting module of HoughNet to two different GAN models and showing that the accuracy is significantly improved in both cases. Code is available at https://github.com/nerminsamet/houghnet . "



Paperid:1127
Authors:Xueya Zhang, Tong Zhang, Xiaobin Hong, Zhen Cui, Jian Yang
Title: Graph Wasserstein Correlation Analysis for Movie Retrieval
Abstract:
Movie graphs play an important role to bridge heterogenous modalities of videos and texts in human-centric retrieval. In this work, we propose Graph Wasserstein Correlation Analysis (GWCA) to deal with the core issue therein, i.e, cross heterogeneous graph comparison. Spectral graph filtering is introduced to encode graph signals, which are then embedded as probability distributions in a Wasserstein space, called graph Wasserstein metric learning. Such a seamless integration of graph signal filtering together with metric learning results in a surprise consistency on both learning processes, in which the goal of metric learning is just to optimize signal filters or vice versa. Further, we derive the solution of the graph comparison model as a classic generalized eigenvalue decomposition problem, which has an exactly closed-form solution. Finally, GWCA together with movie/text graphs generation are unified into the framework of movie retrieval to evaluate our proposed method. Extensive experiments on MovieGrpah dataset demonstrate the effectiveness of our GWCA as well as the entire framework. "



Paperid:1128
Authors:Jianchao Wu, Zhanghui Kuang, Limin Wang, Wayne Zhang, Gangshan Wu
Title: Context-Aware RCNN: A Baseline for Action Detection in Videos
Abstract:
Video action detection approaches usually conduct actor-centric action recognition over RoI-pooled features following the standard pipeline of Faster-RCNN. In this work, we first empirically find the recognition accuracy is highly correlated with the bounding box size of an actor, and thus higher resolution of actors contributes to better performance. However, video models require dense sampling in time to achieve accurate recognition. To fit in GPU memory, the frames to backbone network must be kept low-resolution, resulting in a coarse feature map in RoI-Pooling layer. Thus, we revisit RCNN for actor-centric action recognition via cropping and resizing image patches around actors before feature extraction with I3D deep network. Moreover, we found that expanding actor bounding boxes slightly and fusing the context features can further boost the performance. Consequently, we develop a surpringly effective baseline (Context-Aware RCNN) and it achieves new state-of-the-art results on two challenging action detection benchmarks of AVA and JHMDB. Our observations challenge the conventional wisdom of RoI-Pooling based pipeline and encourage researchers rethink the importance of resolution in actor-centric action recognition. Our approach can serve as a strong baseline for video action detection and is expected to inspire new ideas for this filed. The code is available at https://github.com/MCG-NJU/CRCNN-Action. "



Paperid:1129
Authors:Ning Li, Yongqiang Zhao, Quan Pan, Seong G. Kong, Jonathan Cheung-Wai Chan
Title: Full-Time Monocular Road Detection Using Zero-Distribution Prior of Angle of Polarization
Abstract:
This paper presents a road detection technique based on long-wave infrared (LWIR) polarization imaging for autonomous navigation regardless of illumination conditions, day and night. Division of Focal Plane (DoFP) imaging technology enables acquisition of infrared polarization images in real time using a monocular camera. Zero-distribution prior embodies the zero-distribution of Angle of Polarization (AoP) of a road scene image, which provides a significant contrast between the road and the background. This paper combines zero-distribution of AoP, the difference of Degree of linear Polarization (DoP), and the edge information to segment the road region in the scene. We developed a LWIR DoFP Dataset of Road Scene (LDDRS) consisting of 2,113 annotated images. Experiment results on the LDDRS dataset demonstrate the merits of the proposed road detection method based on the zero-distribution prior. The LDDRS dataset is available at https://github.com/polwork/LDDRS"



Paperid:1130
Authors:Haoxian Zhang, Yang Zhao, Ronggang Wang
Title: A Flexible Recurrent Residual Pyramid Network for Video Frame Interpolation
Abstract:
Video frame interpolation (VFI) aims at synthesizing new video frames in-between existing frames to generate smoother high frame rate videos. Current methods usually use the fixed pre-trained networks to generate interpolated-frames for different resolutions and scenes. However, the fixed pre-trained networks are difficult to be tailored for a variety of cases. Inspired by classical pyramid energy minimization optical flow algorithms, this paper proposes a recurrent residual pyramid network (RRPN) for video frame interpolation. In the proposed network, different pyramid levels share the same weights and base-network, named recurrent residual layer (RRL). In RRL, residual displacements between warped images are detected to gradually refine optical flows rather than directly predict the flows or frames. Owing to the flexible recurrent residual pyramid architecture, we can customize the number of pyramid levels, and make trade-offs between calculations and quality based on the application scenarios. Moreover, occlusion masks are also generated in this recurrent residual way to solve occlusion better. Finally, a refinement network is added to enhance the details for final output with contextual and edge information. Experimental results demonstrate that the RRPN is more flexible and efficient than current VFI networks but has fewer parameters. In particular, the RRPN, which avoid over-reliance on datasets and network structures, shows superior performance for large motion cases."



Paperid:1131
Authors:Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Ming-Hsuan Yang, Ling Shao
Title: Learning Enriched Features for Real Image Restoration and Enhancement
Abstract:
With the goal of recovering high-quality image content from its degraded version, image restoration enjoys numerous applications, such as in surveillance, computational photography and medical imaging. Recently, convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. Existing CNN-based methods typically operate either on full-resolution or on progressively low-resolution representations. In the former case, spatially precise but contextually less robust results are achieved, while in the latter case, semantically reliable but spatially less accurate outputs are generated. In this paper, we present an architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network and receiving strong contextual information from the low-resolution representations. The core of our approach is a multi-scale residual block containing several key elements: (a) parallel multi-resolution convolution streams for extracting multi-scale features, (b) information exchange across the multi-resolution streams, (c) spatial and channel attention mechanisms for capturing contextual information, and (d) attention based multi-scale feature aggregation. In a nutshell, our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Extensive experiments on five real image benchmark datasets demonstrate that our method, named as MIRNet, achieves state-of-the-art results for image denoising, super-resolution, and image enhancement. The source code and pre-trained models are available at https://github.com/swz30/MIRNet."



Paperid:1132
Authors:Wenxiao Zhang, Qingan Yan, Chunxia Xiao
Title: Detail Preserved Point Cloud Completion via Separated Feature Aggregation
Abstract:
Point cloud shape completion is a challenging problem in 3D vision and robotics. Existing learning-based frameworks leverage encoder-decoder architectures to recover the complete shape from a compactly encoded global feature vector. Though the global feature can approximately represent the overall latent shape, it discards some details of the original partial shape during the completion process. In this work, instead of using a global feature to recover the whole complete surface, we explore multi-level features by hierarchical feature learning and represent the existing-part and the missing-part respectively. We propose two different feature aggregation strategies, named global & local feature aggregation(GLFA) and residual feature aggregation(RFA), to express the two parts features and reconstruct coordinates from the combined features. In addition, we also design a refinement component to prevent the generated point from non-uniform distribution and outliers. Extensive experiments have been conducted on the ShapeNet and KITTI dataset. Qualitative and quantitative evaluations demonstrate that our proposed network is more capable of both preserving the original details."



Paperid:1133
Authors:Miao Hao, Yitao Liu, Xiangyu Zhang, Jian Sun
Title: LabelEnc: A New Intermediate Supervision Method for Object Detection
Abstract:
In this paper we propose a new intermediate supervision method, named LabelEnc, to boost the training of object detection systems. The key idea is to introduce a novel label encoding function, mapping the ground-truth labels into latent embedding, acting as an auxiliary intermediate supervision to the detection backbone during training. Our approach mainly involves a two-step training procedure. First, we optimize the label encoding function via an AutoEncoder defined in the label space, approximating the ""desired"" intermediate representations for the target object detector. Second, taking advantage of the learned label encoding function, we introduce a new auxiliary loss attached to the detection backbones, thus benefiting the performance of the derived detector. Experiments show our method improves a variety of detection systems by around 2% on COCO dataset, no matter one-stage or two-stage frameworks. Moreover, the auxiliary structures only exist during training, i.e. it is completely cost-free in inference time."



Paperid:1134
Authors:Clara Fernandez-Labrador, Ajad Chhatkuli, Danda Pani Paudel, Jose J. Guerrero, Cédric Demonceaux, Luc Van Gool
Title: Unsupervised Learning of Category-Specific Symmetric 3D Keypoints from Point Sets
Abstract:
Automatic discovery of category-specific 3D keypoints from a collection of objects of a category is a challenging problem. The difficulty is added when objects are represented by 3D point clouds, with variations in shape and semantic parts and unknown coordinate frames. We define keypoints to be category-specific, if they meaningfully represent objects’ shape and their correspondences can be simply established order-wise across all objects. This paper aims at learning such 3D keypoints, in an unsupervised manner, using a collection of misaligned 3D point clouds of objects from an unknown category. In order to do so, we model shapes defined by the keypoints, within a category, using the symmetric linear basis shapes without assuming the plane of symmetry to be known. The usage of symmetry prior leads us to learn stable keypoints suitable for higher misalignments. To the best of our knowledge, this is the first work on learning such keypoints directly from 3D point clouds for a general category. Using objects from four benchmark datasets, we demonstrate the quality of our learned keypoints by quantitative and qualitative evaluations. Our experiments also show that the keypoints discovered by our method are geometrically and semantically consistent."



Paperid:1135
Authors:Huixia Li, Chenqian Yan, Shaohui Lin, Xiawu Zheng, Baochang Zhang, Fan Yang, Rongrong Ji
Title: PAMS: Quantized Super-Resolution via Parameterized Max Scale
Abstract:
Deep convolutional neural networks (DCNNs) have shown dominant performance in the task of super-resolution (SR). However, their heavy memory cost and computation overhead significantly restrict their practical deployments on resource-limited devices, which mainly arise from the floating-point storage and operations between weights and activations. Although previous endeavors mainly resort to fixed-point operations, quantizing both weights and activations with fixed coding lengths may cause significant performance drop, especially on low bits. Specifically, most state-of-the-art SR models without batch normalization have a large dynamic quantization range, which also serves as another cause of performance drop. To address these two issues, we propose a new quantization scheme termed PArameterized Max Scale (PAMS), which applies the trainable truncated parameter to explore the upper bound of the quantization range adaptively. Finally, a structured knowledge transfer (SKT) loss is introduced to fine-tune the quantized network. Extensive experiments demonstrate that the proposed PAMS scheme can well compress and accelerate the existing SR models such as EDSR and RDN. Notably, 8-bit PAMS-EDSR improves PSNR on Set5 benchmark from 32.095dB to 32.124dB with 2.42$ imes$ compression ratio, which achieves a new state-of-the-art."



Paperid:1136
Authors:Xinge Zhu      Yuexin Ma      Tai Wang      Yan Xu Jianping Shi      Dahua Lin
Title: SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds
Abstract:
Multi-class 3D object detection aims to classify and localize objects of multiple categories from point clouds. Due to the nature of point clouds, i.e. unstructured, sparse and noisy, some features benefit-ting multi-class discrimination are underexploited, such as shape information. In this paper, we propose a novel 3D shape signature to explore the shape information from point clouds. By incorporating operations of symmetry, convex hull and chebyshev fitting, the proposed shape sig-nature is not only compact and effective but also robust to the noise, which serves as a soft constraint to improve the feature capability of multi-class discrimination. Based on the proposed shape signature, we develop the shape signature networks (SSN) for 3D object detection, which consist of pyramid feature encoding part, shape-aware grouping heads and explicit shape encoding objective. Experiments show that the proposed method performs remarkably better than existing methods on two large-scale datasets. Furthermore, our shape signature could act as a plug-and-play component and ablation study shows its effectiveness and good scalability"



Paperid:1137
Authors:Liang Chen, Faming Fang, Jiawei Zhang, Jun Liu, Guixu Zhang
Title: OID: Outlier Identifying and Discarding in Blind Image Deblurring
Abstract:
Blind deblurring methods are sensitive to outliers, such as saturated pixels and non-Gaussian noise. Even a small amount of outliers can dramatically degrade the quality of the estimated blur kernel, because the outliers are not conforming to the linear formation of the blurring process. Prior arts develop sophisticated edge-selecting steps or noise filtering pre-processing steps to deal with outliers (i.e. indirect approaches). However, these indirect approaches may fail when massive outliers are presented, since informative details may be polluted by outliers or erased during the pre-processing steps. To address these problems, this paper develops a simple yet effective Outlier Identifying and Discarding (OID) method, which alleviates limitations in existing Maximum A Posteriori (MAP)- based deblurring models when significant outliers are presented. Unlike previous indirect outlier processing methods, OID tackles outliers directly by explicitly identifying and discarding them, when updating both the latent image and the blur kernel during the deblurring process, where the outliers are detected by using the sparse and entropy-based modules. OID is easy to implement and extendable for non-blind restoration. Extensive experiments demonstrate the superiority of OID against recent works both quantitatively and qualitatively."



Paperid:1138
Authors:Mateusz Michalkiewicz, Sarah Parisot, Stavros Tsogkas, Mahsa Baktashmotlagh, Anders Eriksson, Eugene Belilovsky
Title: Few-Shot Single-View 3-D Object Reconstruction with Compositional Priors
Abstract:
The impressive performance of deep convolutional neural networks in single-view 3D reconstruction suggests that these models perform non-trivial reasoning about the 3D structure of the output space. However, recent work has challenged this belief, showing that complex encoder-decoder architectures perform similarly to nearest-neighbor baselines or simple linear decoder models that exploit large amounts of per category data in standard benchmarks. On the other hand settings where 3D shape must be inferred for new categories with few examples are more natural and require models that generalize about shapes. In this work we demonstrate experimentally that naive baselines do not apply when the goal is to learn to reconstruct novel objects using very few examples, and that in a mph{few-shot} learning setting, the network must learn concepts that can be applied to new categories, avoiding rote memorization. To address deficiencies in existing approaches to this problem, we propose three approaches that efficiently integrate a class prior into a 3D reconstruction model, allowing to account for intra-class variability and imposing an implicit compositional structure that the model should learn. Experiments on the popular ShapeNet database demonstrate that our method significantly outperform existing baselines on this task in the few-shot setting."



Paperid:1139
Authors:Liang Chen, Faming Fang, Shen Lei, Fang Li, Guixu Zhang
Title: Enhanced Sparse Model for Blind Deblurring
Abstract:
Existing arts have shown promising efforts to deal with the blind deblurring task. However, most of the recent works assume the additive noise involved in the blurring process to be simple-distributed (i.e. Gaussian or Laplacian), while the real-world case is proved to be much more complicated. In this paper, we develop a new term to better fit the complex natural noise. Specifically, we use a weighted combination of a dense function (i.e. l2) and a newly designed enhanced sparse model termed as le, which is developed from two sparse models (i.e. l1 and l0), to fulfill the task. Moreover, we further suggest using le to regularize image gradients. Compared to the widely-adopted l0 sparse term, le can penalize more insignificant image details (Fig. 1). Based on the half-quadratic splitting method, we provide an effective scheme to optimize the overall formulation. Comprehensive evaluations on public datasets and real-world images demonstrate the superiority of the proposed method against state-of-the-art methods in terms of both speed and accuracy."



Paperid:1140
Authors:Jungin Park, Jiyoung Lee, Ig-Jae Kim, Kwanghoon Sohn
Title: SumGraph: Video Summarization via Recursive Graph Modeling
Abstract:
The goal of video summarization is to select keyframes that are visually diverse and can represent a whole story of an input video. State-of-the-art approaches for video summarization have mostly regarded the task as a frame-wise keyframe selection problem by aggregating all frames with equal weight. However, to find informative parts of the video, it is necessary to consider how all the frames of the video are related to each other. To this end, we cast video summarization as a graph modeling problem. We propose recursive graph modeling networks for video summarization, termed SumGraph, to represent a relation graph, where frames are regarded as nodes and nodes are connected by semantic relationships among frames. Our networks accomplish this through a recursive approach to refine an initially estimated graph to correctly classify each node as a keyframe by reasoning the graph representation via graph convolutional networks. To leverage SumGraph in a more practical environment, we also present a way to adapt our graph modeling in an unsupervised fashion. With SumGraph, we achieved state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners."



Paperid:1141
Authors:Kunran Xu, Lai Rui, Yishi Li, Lin Gu
Title: Feature Normalized Knowledge Distillation for Image Classification
Abstract:
Knowledge Distillation (KD) transfers the knowledge from a cumbersome teacher model to a lightweight student network. Since a single image may reasonably relate to several categories, the one-hot label would inevitably introduce the encoding noise. From this perspective, we systematically analyze the distillation mechanism and demonstrate that the L2-norm of the feature in penultimate layer would be too large under the influence of label noise, and the temperature T in KD could be regarded as a correction factor for L2-norm to suppress the impact of noise. Noticing different samples suffer from varying intensities of label noise, we further propose a simple yet effective feature normalized knowledge distillation which introduces the sample specific correction factor to replace the unified temperature T for better reducing the impact of noise. Extensive experiments show that the proposed method surpasses standard KD as well as self-distillation significantly on Cifar-100, CUB-200-2011 and Stanford Cars datasets. The codes are in https://github.com/aztc/FNKD"



Paperid:1142
Authors:Kevin Musgrave, Serge Belongie, Ser-Nam Lim
Title: A Metric Learning Reality Check
Abstract:
Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental setup of these papers, and propose a new way to evaluate metric learning algorithms. Finally, we present experimental results that show that the improvements over time have been marginal at best."



Paperid:1143
Authors:Kunyuan Du, Ya Zhang, Haibing Guan, Qi Tian, Shenggan Cheng, James Lin
Title: FTL: A universal framework for training low-bit DNNs via Feature Transfer
Abstract:
Low-bit Deep Neural Networks (low-bit DNNs) have recently received significant attention for their high efficiency. However, low-bit DNNs are often difficult to optimize due to the the saddle points in loss surfaces. Here we introduce a novel feature-based knowledge transfer framework, which utilizes a 32-bit DNN to guide the training of a low-bit DNN via feature maps. It is challenge because feature maps from two branches lie in continuous and discrete space respectively, and such mismatch has not been handled properly by existing feature transfer frameworks. In this paper, we propose to directly transfer information-rich continuous-space feature to the low-bit branch. To alleviate the negative impacts brought by the feature quantizer during the transfer process, we make two branches interact via centered cosine distance rather than the widely-used p-norms. Extensive experiments are conducted on Cifar10/100 and ImageNet. Compared with low-bit models trained directly, the proposed framework brings 0.5% to 3.4% accuracy gains to three different quantization schemes. Besides, the proposed framework can also be combined with other techniques, e.g. logits transfer, for further enhacement."



Paperid:1144
Authors:Hao Tang, Song Bai, Li Zhang, Philip H.S. Torr, Nicu Sebe
Title: XingGAN for Person Image Generation
Abstract:
We propose a novel Generative Adversarial Network (XingGAN or CrossingGAN) for person image generation tasks, i.e., translating the pose of a given person to a desired one. The proposed Xing generator consists of two generation branches that model the person's appearance and shape information, respectively. Moreover, we propose two novel blocks to effectively transfer and update the person's shape and appearance embeddings in a crossing way to mutually improve each other, which has not been considered by any other existing GAN-based image generation work. Extensive experiments on two challenging datasets, i.e., Market-1501 and DeepFashion, demonstrate that the proposed XingGAN advances the state-of-the-art performance both in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/XingGAN."



Paperid:1145
Authors:Chuang Niu, Jun Zhang, Ge Wang, Jimin Liang
Title: GATCluster: Self-Supervised Gaussian-Attention Network for Image Clustering
Abstract:
We propose a self-supervised Gaussian ATtention network for image Clustering (GATCluster). Rather than extracting intermediate features first and then performing traditional clustering algorithms, GATCluster directly outputs semantic cluster labels without further post-processing. We give a Label Feature Theorem to guarantee that the learned features are one-hot encoded vectors and the trivial solutions are avoided. Based on this theorem, we design four self-learning tasks with the constraints of transformation invariance, separability maximization, entropy analysis, and attention mapping. Specifically, the transformation invariance and separability maximization tasks learn the relations between samples. The entropy analysis task aims to avoid trivial solutions. To capture the object-oriented semantics, we design a self-supervised attention mechanism that includes a Gaussian attention module and a soft-attention loss. Moreover, we design a two-step learning algorithm that is memory-efficient for clustering large-size images. Extensive experiments demonstrate the superiority of our proposed method in comparison with the state-of-the-art image clustering benchmarks."



Paperid:1146
Authors:Yi Wang, Ying-Cong Chen, Xin Tao, Jiaya Jia
Title: VCNet: A Robust Approach to Blind Image Inpainting
Abstract:
Blind inpainting is a task to automatically complete visual contents without specifying masks for missing areas in an image. Previous works assume missing region patterns are known, limiting its application scope. In this paper, we relax the assumption by defining a new blind inpainting setting, making training a blind inpainting neural system robust against various unknown missing region patterns. Specifically, we propose a two-stage visual consistency network (VCN), meant to estimate where to fill (via masks) and generate what to fill. In this procedure, the unavoidable potential mask prediction errors lead to severe artifacts in the subsequent repairing. To address it, our VCN predicts semantically inconsistent regions first, making mask prediction more tractable. Then it repairs these estimated missing regions using a new spatial normalization, enabling VCN to be robust to the mask prediction errors. In this way, semantically convincing and visually compelling content is thus generated. Extensive experiments are conducted, showing our method is effective and robust in blind image inpainting. And our VCN allows for a wide spectrum of applications."



Paperid:1147
Authors:Jianbo Liu, Junjun He, Yu Qiao, Jimmy S. Ren, Hongsheng Li
Title: Learning to Predict Context-adaptive Convolution for Semantic Segmentation
Abstract:
Long-range contextual information is essential for achieving high-performance semantic segmentation. Previous feature re-weighting methods demonstrate that using global context for re-weighting feature channels can effectively improve the accuracy of semantic segmentation. However, the globally-sharing feature re-weighting vector might not be optimal for regions of different classes in the input image. In this paper, we propose a Context-adaptive Convolution Network (CaC-Net) to predict a spatially-varying feature weighting vector for each spatial location of the semantic feature maps. In CaC-Net, a set of context-adaptive convolution kernels are predicted from the global contextual information in a parameter-efficient manner. When used for convolution with the semantic feature maps, the predicted convolutional kernels can generate the spatially-varying feature weighting factors capturing both global and local contextual information. Comprehensive experimental results show that our CaC-Net achieves superior segmentation performance on three public datasets, PASCAL Context, PASCAL VOC 2012 and ADE20K."



Paperid:1148
Authors:Jianbo Liu, Junjun He, Jiawei Zhang, Jimmy S. Ren, Hongsheng Li
Title: EfficientFCN: Holistically-guided Decoding for Semantic Segmentation
Abstract:
Both performance and efficiency are important to semantic segmentation. State-of-the-art semantic segmentation algorithms are mostly based on dilated Fully Convolutional Networks (dilatedFCN), which adopt dilated convolutions in the backbone networks to extract high-resolution feature maps for achieving high-performance segmentation performance. However,due to many convolution operations are conducted on the high-resolution feature maps, such dilatedFCN-based methods result in large computational complexity and memory consumption. To balance the performance and efficiency, there also exist encoder-decoder structures that gradually recover the spatial information by combining multi-level feature maps from the encoder. However, the performances of existing encoder-decoder methods are far from comparable with the dilatedFCN-based methods. In this paper, we propose the EfficientFCN, whose backbone is a common ImageNet pretrained network without any dilated convolution. A holistically-guided decoder is introduced to obtain the high-resolution semantic-rich feature maps via the multi-scale features from the encoder. The decoding task is converted to novel codebook generation and codeword assembly tasks, which takes advantages of the high-level and low-level features from the encoder. Such a framework achieves comparable or even better performance than state-of-the-art methods with only 1/3 of the computational cost. Extensive experiments on PASCAL Context, PASCAL VOC, ADE20K validate the effectiveness of the proposed EfficientFCN."



Paperid:1149
Authors:Henry Howard-Jenkins, Yiwen Li, Victor Adrian Prisacariu
Title: GroSS: Group-Size Series Decomposition for Grouped Architecture Search
Abstract:
We present a novel approach which is able to explore the configuration of grouped convolutions within neural networks. Group-size Series (GroSS) decomposition is a mathematical formulation of tensor factorisation into a series of approximations of increasing rank terms. GroSS allows for dynamic and differentiable selection of factorisation rank, which is analogous to a grouped convolution. Therefore, to the best of our knowledge, GroSS is the first method to enable simultaneous training of differing numbers of groups within a single layer, as well as all possible combinations between layers. In doing so, GroSS is able to train an entire grouped convolution architecture search-space concurrently. We demonstrate this through architecture searches with performance objectives on multiple datasets and networks. GroSS enables more effective and efficient search for grouped convolutional architectures."



Paperid:1150
Authors:Siyuan Liang, Xingxing Wei, Siyuan Yao, Xiaochun Cao
Title: Efficient Adversarial Attacks for Visual Object Tracking
Abstract:
Visual object tracking is an important task that requires the tracker to find the objects quickly and accurately. The existing state-of-the-art object trackers, i.e., Siamese based trackers, use DNNs to attain high accuracy. However, the robustness of visual tracking models is seldom explored. In this paper, we analyze the weakness of object trackers based on the Siamese network and then extend adversarial examples to visual object tracking. We present an end-to-end network FAN (Fast Attack Network) that uses a novel drift loss combined with the embedded feature loss to attack the Siamese network based trackers. Under a single GPU, FAN is efficient in the training speed and has a strong attack performance. The FAN can generate an adversarial example at 10ms, achieve effective targeted attack (at least 40% drop rate on OTB) and untargeted attack (at least 70% drop rate on OTB)."



Paperid:1151
Authors:Xin Peng, Yifu Wang, Ling Gao, Laurent Kneip
Title: Globally-Optimal Event Camera Motion Estimation
Abstract:
Event cameras are bio-inspired sensors that perform well in HDR conditions and have high temporal resolution. However, different from traditional frame-based cameras, event cameras measure asynchronous pixel-level brightness changes and return them in a highly discretised format, hence new algorithms are needed. The present paper looks at fronto-parallel motion estimation of an event camera. The flow of the events is modeled by a general homographic warping in a space-time volume, and the objective is formulated as a maximisation of contrast within the image of unwarped events. However, in stark contrast to prior art, we derive a globally optimal solution to this generally non-convex problem, and thus remove the dependency on a good initial guess. Our algorithm relies on branch-and-bound optimisation for which we derive novel, recursive upper and lower bounds for six different contrast estimation functions. The practical validity of our approach is supported by a highly successful application to AGV motion estimation with a downward facing event camera, a challenging scenario in which the sensor experiences fronto-parallel motion in front of noisy, fast moving textures."



Paperid:1152
Authors:Petrissa Zell, Bodo Rosenhahn, Bastian Wandt
Title: Weakly-supervised Learning of Human Dynamics
Abstract:
This paper proposes a weakly-supervised learning framework for dynamics estimation from human motion. Although there are many solutions to capture pure human motion readily available, their data is not sufficient to analyze quality and efficiency of movements. Instead, the forces and moments driving human motion (the dynamics) need to be considered. Since recording dynamics is a laborious task that requires expensive sensors and complex, time-consuming optimization, dynamics data sets are small compared to human motion data sets and are rarely made public. The proposed approach takes advantage of easily obtainable motion data which enables weakly-supervised learning on small dynamics sets and weakly-supervised domain transfer. Our method includes novel neural network (NN) layers for forward and inverse dynamics during end-to-end training. On this basis, a cyclic loss between pure motion data can be minimized, i.e. no ground truth forces and moments are required during training. The proposed method achieves state-of-the-art results in terms of ground reaction force, ground reaction moment and joint torque regression and is able to maintain good performance on substantially reduced sets."



Paperid:1153
Authors:Royson Lee, Łukasz Dudziak, Mohamed Abdelfattah, Stylianos I. Venieris, Hyeji Kim, Hongkai Wen, Nicholas D. Lane
Title: Journey Towards Tiny Perceptual Super-Resolution
Abstract:
Recent works in single-image perceptual super-resolution (SR) have demonstrated unprecedented performance in generating realistic textures by means of deep convolutional networks. However, these convolutional models are large and expensive, preventing them from being deployed to devices that require them. In this work, we propose a neural architecture search (NAS) approach that integrates NAS and generative adversarial networks (GANs) with recent advances in perceptual SR and pushes the efficiency of small perceptual SR models to facilitate on-device execution. Specifically, we search over the architectures of both the generator and the discriminator sequentially, highlighting the unique challenges and key observations of searching for an SR-optimised discriminator and comparing them with existing discriminator architectures in the literature. Our tiny perceptual SR (TPSR) models outperform SRGAN and EnhanceNet on both full-reference perceptual metric (LPIPS) and distortion metric (PSNR) while being up to 26.4$ imes$ more memory efficient and 33.6$ imes$ more compute efficient respectively."



Paperid:1154
Authors:Lucy Chai, David Bau, Ser-Nam Lim, Phillip Isola
Title: What makes fake images detectable? Understanding properties that generalize
Abstract:
The quality of image generation and manipulation is reaching impressive levels, making it exceedingly difficult for a human to distinguish between what is real and what is fake. However, deep networks can still pick up on the subtle artifacts in these doctored images. We seek to understand what properties of these fake images make them detectable and identify what generalizes across different model architectures, datasets, and variations in training. We use a patch-based classifier with limited receptive fields to focus on low-level artifacts rather than global semantics, and use patch-wise predictions to localize the manipulated regions. We further show a technique to exaggerate these detectable properties and demonstrate that even when the image generator is adversarially finetuned against the fakeness classifier, it is still imperfect and makes detectable mistakes in similar regions of the image."



Paperid:1155
Authors:Pau Rodríguez, Issam Laradji, Alexandre Drouin, Alexandre Lacoste
Title: Embedding Propagation: Smoother Manifold for Few-Shot Classification
Abstract:
Few-shot classification is challenging because the data distribution of the training set can be widely different to the test set as their classes are disjoint. This distribution shift often results in poor generalization. Manifold smoothing has been shown to address the distribution shift problem by extending the decision boundaries and reducing the noise of the class representations. Moreover, manifold smoothness is a key factor for semi-supervised learning and transductive learning algorithms. In this work, we propose to use embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing in few-shot classification. Embedding propagation leverages interpolations between the extracted features of a neural network based on a similarity graph. We empirically show that embedding propagation yields a smoother embedding manifold. We also show that applying embedding propagation to a transductive classifier achieves new state-of-the-art results in extit{mini}Imagenet, extit{tiered}Imagenet, Imagenet-FS, and CUB. Furthermore, we show that embedding propagation consistently improves the accuracy of the models in multiple semi-supervised learning scenarios by up to 16\% points. The proposed embedding propagation operation can be easily integrated as a non-parametric layer into a neural network. We provide the training code and usage examples at https://github.com/ElementAI/embedding-propagation."



Paperid:1156
Authors:Xu Chen, Zijian Dong, Jie Song, Andreas Geiger, Otmar Hilliges
Title: Category Level Object Pose Estimation via Neural Analysis-by-Synthesis
Abstract:
Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters, i.e. poses and appearances. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully. "



Paperid:1157
Authors:Wonkwang Lee, Donggyun Kim, Seunghoon Hong, Honglak Lee
Title: High-Fidelity Synthesis with Disentangled Representation
Abstract:
Learning disentangled representation of data without supervision is an important step towards improving the interpretability of generative models. Despite recent advances in disentangled representation learning, existing approaches often suffer from the trade-off between representation learning and generation performance (i.e. improving generation quality sacrifices disentanglement performance). We propose an Information-Distillation Generative Adversarial Network (ID-GAN), a simple yet generic framework that can easily incorporate the existing state-of-the-art models for both disentanglement learning and high-fidelity synthesis. Our method learns disentangled representation using VAE-based models, and distills the learned representation together with additional nuisance variable to the separate GAN-based generator for high-fidelity synthesis. To ensure that both generative models are aligned to render the same generative factors, we additionally constrain the GAN generator to maximize the mutual information between the learned latent code and the output. Despite the simplicity, we show that the proposed method is surprisingly effective, achieving comparable image generation quality to the state-of-the-arts using the disentangled representation. We also show that the proposed decomposition leads to efficient and stable model design, and demonstrate it on high-resolution image synthesis task (1024x1024 pixels) for the first time using the disentangled representations."



Paperid:1158
Authors:Timothy Duff, Kathlén Kohn, Anton Leykin, Tomas Pajdla
Title: PL₁P - Point-line Minimal Problems under Partial Visibility in Three Views
Abstract:
We present a complete classification of minimal problems for generic arrangements of points and lines in space observed partially by three calibrated perspective cameras when each line is incident to at most one point. This is a large class of interesting minimal problems that allows for missing observations in images due to occlusions and missed detections. There is an infinite number of such minimal problems; however, we show that they can be reduced to 140616 equivalence classes by removing superfluous features and relabeling the cameras. We also introduce camera-minimal problems, which are practical for designing minimal solvers, and show how to pick a simplest camera-minimal problem for each minimal problem. This simplification results in 74575 equivalence classes. Only 76 of these were known; the rest are new. In order to identify problems that have potential for practical solving of image matching and 3D reconstruction, we present several smaller natural subfamilies of camera-minimal problems as well as compute solution counts for all camera-minimal problems which have less than 300 solutions for generic data."



Paperid:1159
Authors:Ke Han, Yan Huang, Zerui Chen, Liang Wang, Tieniu Tan
Title: Prediction and Recovery for Adaptive Low-Resolution Person Re-Identification
Abstract:
Low-resolution person re-identification (LR re-id) is a challenging task with low-resolution probes and high-resolution gallery images. To address the resolution mismatch, existing methods typically recover missing details for low-resolution probes by super-resolution. However, they usually pre-specify fixed scale factors for all images, and ignore the fact that choosing a preferable scale factor for certain image content probably greatly benefits the identification. In this paper, we propose a novel Prediction, Recovery and Identification (PRI) model for LR re-id, which adaptively recovers missing details by predicting a preferable scale factor based on the image content. To deal with the lack of ground-truth optimal scale factors, our model contains a self-supervised scale factor metric that automatically generates dynamic soft labels. The generated labels indicate probabilities that each scale factor is optimal, which are used as guidance to enhance the content-aware scale factor prediction. Consequently, our model can more accurately predict and recover the content-aware details, and achieve state-of-the-art performances on four LR re-id datasets."



Paperid:1160
Authors:Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, Amir Globerson
Title: Learning Canonical Representations for Scene Graph to Image Generation
Abstract:
Generating realistic images of complex visual scenes becomes challenging when one wishes to control the structure of the generated images. Previous approaches showed that scenes with few entities can be controlled using scene graphs, but this approach struggles as the complexity of the graph (the number of objects and edges) increases. In this work, we show that one limitation of current methods is their inability to capture semantic equivalence in graphs. We present a novel model that addresses these issues by learning canonical graph representations from the data, resulting in improved image generation for complex visual scenes. Our model demonstrates improved empirical performance on large scene graphs, robustness to noise in the input scene graph, and generalization on semantically equivalent graphs. Finally, we show improved performance of the model on three different benchmarks: Visual Genome, COCO, and CLEVR."



Paperid:1161
Authors:Maximilian Augustin, Alexander Meinke, Matthias Hein
Title: Adversarial Robustness on In- and Out-Distribution Improves Explainability
Abstract:
Neural networks have led to major improvements in image classification but suffer from being non-robust to adversarial changes, unreliable uncertainty estimates on out-distribution samples and their inscrutable black-box decisions. In this work we propose RATIO, a training procedure for Robustness via Adversarial Training on In- and Out-distribution, which leads to robust models with reliable and robust confidence estimates on the out-distribution. RATIO has similar generative properties to adversarial training so that visual counterfactuals produce class specific features. While adversarial training comes at the price of lower clean accuracy, RATIO achieves state-of-the-art $l_2$-adversarial robustness on CIFAR10 and maintains better clean accuracy."



Paperid:1162
Authors:Sunnie S. Y. Kim, Nicholas Kolkin, Jason Salavon, Gregory Shakhnarovich
Title: Deformable Style Transfer
Abstract:
Both geometry and texture are fundamental aspects of visual style. Existing style transfer methods, however, primarily focus on texture, almost entirely ignoring geometry. We propose deformable style transfer (DST), an optimization-based approach that jointly stylizes the texture and geometry of a content image to better match a style image. Unlike previous geometry-aware stylization methods, our approach is neither restricted to a particular domain (such as human faces), nor does it require training sets of matching style/content pairs. We demonstrate our method on a diverse set of content and style images including portraits, animals, objects, scenes, and paintings. Code has been made publicly available at https://github.com/sunniesuhyoung/DST."



Paperid:1163
Authors:Senthil Purushwalkam, Tian Ye, Saurabh Gupta, Abhinav Gupta
Title: Aligning Videos in Space and Time
Abstract:
In this paper, we focus on the task of extracting visual correspondences across videos. Given a query video clip from an action class, we aim to align it with training videos in space and time. Obtaining training data for such a fine-grained alignment task is challenging and often ambiguous. Hence, we propose a novel alignment procedure that learns such correspondence in space and time via cross video cycle-consistency. During training, given a pair of videos, we compute cycles that connect patches in a given frame in the first video by matching through frames in the second video. Cycles that connect overlapping patches together are encouraged to score higher than cycles that connect non-overlapping patches. Our experiments on the Penn Action and Pouring datasets demonstrate that the proposed method can successfully learn to correspond semantically similar patches across videos, and learns representations that are sensitive to object and action states."



Paperid:1164
Authors:Yuan Xue, Zihan Zhou, Xiaolei Huang
Title: Neural Wireframe Renderer: Learning Wireframe to Image Translations
Abstract:
In architecture and computer-aided design, wireframes (i.e., line-based models) are widely used as basic 3D models for design evaluation and fast design iterations. However, unlike a full design file, a wireframe model lacks critical information, such as detailed shape, texture, and materials, needed by a conventional renderer to produce 2D renderings of the objects or scenes. In this paper, we bridge the information gap by generating photo-realistic rendering of indoor scenes from wireframe models in an image translation framework. While existing image synthesis methods can generate visually pleasing images for common objects such as faces and birds, these methods do not explicitly model and preserve essential structural constraints in a wireframe model, such as junctions, parallel lines, and planar surfaces. To this end, we propose a novel model based on a structure-appearance joint representation learned from both images and wireframes. In our model, structural constraints are explicitly enforced by learning a joint representation in a shared encoder network that must support the generation of both images and wireframes. Experiments on a wireframe-scene dataset show that our wireframe-to-image translation model significantly outperforms the state-of-the-art methods in both visual quality and structural integrity of generated images."



Paperid:1165
Authors:Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li
Title: RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax
Abstract:
Deep neural networks have achieved remarkable successes in learning feature representations for visual classification. However, deep features learned by the softmax cross-entropy loss generally show excessive intra-class variations. We argue that, because the traditional softmax losses aim to optimize only the relative differences between intra-class and inter-class distances (logits), it cannot obtain representative class prototypes (class weights/centers) to regularize intra-class distances, even when the training is converged. Previous efforts mitigate this problem by introducing auxiliary regularization losses. But these modified losses mainly focus on optimizing intra-class compactness, while ignoring keeping reasonable relations between different class prototypes. These lead to weak models and eventually limit their performance. To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization. The proposed RBF-Softmax loss function not only effectively reduces intra-class distances, stabilizes the training behavior, and reserves ideal relations between prototypes, but also significantly improves the testing performance. Experiments on visual recognition benchmarks including MNIST, CIFAR-10/100, and ImageNet demonstrate that the proposed RBF-Softmax achieves better results than cross-entropy and other state-of-the-art classification losses. The code is at https://github.com/2han9x1a0release/RBF-Softmax."



Paperid:1166
Authors:Kelvin Wong, Qiang Zhang, Ming Liang, Bin Yang, Renjie Liao, Abbas Sadat, Raquel Urtasun
Title: Testing the Safety of Self-driving Vehicles by Simulating Perception and Prediction
Abstract:
We present a novel method for testing the safety of self-driving vehicles in simulation. We propose an alternative to sensor simulation, as sensor simulation is expensive and has large domain gaps. Instead, we directly simulate the outputs of the self-driving vehicle’s perception and prediction system, enabling realistic motion planning testing. Specifically, we use paired data in the form of ground truth labels and real perception and prediction outputs to train a model that predicts what the online system will produce. Importantly, the inputs to our system consists of high definition maps, bounding boxes, and trajectories, which can be easily sketched by a test engineer in a matter of minutes. This makes our approach a much more scalable solution. Quantitative results on two large-scale datasets demonstrate that we can realistically test motion planning using our simulations"



Paperid:1167
Authors:Christian Reimers, Jakob Runge, Joachim Denzler
Title: Determining the Relevance of Features for Deep Neural Networks
Abstract:
Deep neural networks are tremendously successful in many applications, but end-to-end trained networks often result in hard to understand black-box classifiers or predictors. In this work, we present a novel method to identify whether a specific feature is relevant to a classifier’s decision or not. This relevance is determined at the level of the learned mapping, instead of for a single example. The approach does neither need retraining of the network nor information on intermediate results or gradients. The key idea of our approach builds upon concepts from causal inference. We interpret machine learning in a structural causal model and use Reichenbach’s common cause principle to infer whether a feature is relevant. We demonstrate empirically that the method is able to successfully evaluate the relevance of given features on three real-life data sets, namely MS COCO, CUB200 and HAM10000."



Paperid:1168
Authors:Liyi Chen, Weiwei Wu, Chenchen Fu, Xiao Han, Yuntao Zhang
Title: Weakly Supervised Semantic Segmentation with Boundary Exploration
Abstract:
Weakly supervised semantic segmentation with image-level labels has attracted a lot of attention recently because these labels are already available in most datasets. To obtain semantic segmentation under weak supervision, this paper presents a simple yet effective approach based on the idea of explicitly exploring object boundaries from training images to keep coincidence of segmentation and boundaries. Specifically, we synthesize boundary annotations by exploiting coarse localization maps obtained from CNN classifier, and use annotations to train the proposed network called BENet which further excavates more object boundaries to provide constraints for segmentation. Finally generated pseudo annotations of training images are used to supervise an off-the-shelf segmentation network. We evaluate the proposed method on PASCAL VOC 2012 benchmark and the final results achieve 65.7% and 66.6% mIoU scores on val and test sets respectively, which outperforms previous methods trained under image-level supervision."



Paperid:1169
Authors:Wallace Lira, Johannes Merz, Daniel Ritchie, Daniel Cohen-Or, Hao Zhang
Title: GANHopper: Multi-Hop GAN for Unsupervised Image-to-Image Translation
Abstract:
We introduce GANHopper, an unsupervised image-to-image translation network that transforms images gradually between two domains, through multiple hops. Instead of executing translation directly, we steer the translation by requiring the network to produce in-between images that resemble weighted hybrids between images from the input domains. Our network is trained on unpaired images from the two domains only, without any in-between images. All hops are produced using a single generator along each direction. In addition to the standard cycle-consistency and adversarial losses, we introduce a new hybrid discriminator, which is trained to classify the intermediate images produced by the generator as weighted hybrids, with weights based on a predetermined hop count. We also add a smoothness term to constrain the magnitude of each hop, further regularizing the translation. Compared to previous methods, GANHopper excels at image translations involving domain-specific image features and geometric variations while also preserving non-domain-specific features such as general color schemes"



Paperid:1170
Authors:Philippe Weinzaepfel, Romain Brégier, Hadrien Combaluzier, Vincent Leroy, Grégory Rogez
Title: DOPE: Distillation Of Part Experts for whole-body 3D pose estimation in the wild
Abstract:
We introduce DOPE, the first method to detect and estimate whole-body 3D human poses, including bodies, hands and faces, in the wild. Achieving this level of details is key for a number of applications that require understanding the interactions of the people with each other or with the environment. The main challenge is the lack of in-the-wild data with labeled whole-body 3D poses. In previous work, training data has been annotated or generated for simpler tasks focusing on bodies, hands or faces separately. In this work, we propose to take advantage of these datasets to train independent experts for each part, namely a body, a hand and a face expert, and distill their knowledge into a single deep network designed for whole-body 2D-3D pose detection. In practice, given a training image with partial or no annotation, each part expert detects its subset of keypoints in 2D and 3D and the resulting estimations are combined to obtain whole-body pseudo ground-truth poses. A distillation loss encourages the whole-body predictions to mimic the experts' outputs. Our results show that this approach significantly outperforms the same whole-body model trained without distillation while staying close to the performance of the experts. Importantly, DOPE is computationally less demanding than the ensemble of experts and can achieve real-time performance. Test code and models are available at https://europe.naverlabs.com/research/computer-vision/dope."



Paperid:1171
Authors:Nikolas Adaloglou, Nicholas Vretos, Petros Daras
Title: Multi-view adaptive graph convolutions for graph classification
Abstract:
In this paper, a novel multi-view methodology for graph-based neural networks is proposed. A systematic and methodological adaptation of the key concepts of classical deep learning methods such as convolution, pooling and multi-view architectures is developed for the context of non-Euclidean manifolds. The aim of the proposed work is to present a novel multi-view graph convolution layer, as well as a new view pooling layer making use of: a) a new hybrid Laplacian that is adjusted based on feature distance metric learning, b) multiple trainable representations of a feature matrix of a graph, using trainable distance matrices, adapting the notion of views to graphs and c) a multi-view graph aggregation scheme called graph view pooling, in order to synthesise information from the multiple generated ""views"". The aforementioned layers are used in an end-to-end graph neural network architecture for graph classification and show competitive results to other state-of-the-art methods."



Paperid:1172
Authors:Ke Mei, Chuang Zhu, Jiaqi Zou, Shanghang Zhang
Title: Instance Adaptive Self-Training for Unsupervised Domain Adaptation
Abstract:
The divergence between labeled training data and unlabeled testing data is a significant challenge for recent deep learning models. Unsupervised domain adaptation (UDA) attempts to solve such a problem. Recent works show that self-training is a powerful approach to UDA. However, existing methods have difficulty in balancing scalability and performance. In this paper, we propose an instance adaptive self-training framework for UDA on the task of semantic segmentation. To effectively improve the quality of pseudo-labels, we develop a novel pseudo-label generation strategy with an instance adaptive selector. Besides, we propose the region-guided regularization to smooth the pseudo-label region and sharpen the non-pseudo-label region. Our method is so concise and efficient that it is easy to be generalized to other unsupervised domain adaptation methods. Experiments on `GTA5 to Cityscapes' and `SYNTHIA to Cityscapes' demonstrate the superior performance of our approach compared with the state-of-the-art methods."



Paperid:1173
Authors:Juseung Yun, Byungjoo Kim, Junmo Kim
Title: Weight Decay Scheduling and Knowledge Distillation for Active Learning
Abstract:
Although convolutional neural networks perform extremely well for numerous computer vision tasks, a considerably large amount of labeled data is required to ensure a good outcome. Data labeling is labor-intensive, and in some cases, the labeling budget may be limited. Active learning is a technique that can reduce the labeling required. With this technique, the neural network selects on its own the unlabeled data most helpful for learning, and then requests the human annotator for the labels. Most existing active learning methods have focused on acquisition functions for an effective selection of the informative samples. However, in this paper, we focus on the data-incremental nature of active learning, and propose a method for properly tuning the weight decay as the amount of data increases. We also demonstrate that the performance can be improved by knowledge distillation using a low-performance teacher model trained from the previous acquisition step. In addition, we present a novel perspective of the weight decay, which provides a regularization effect by limiting the number of effective parameters and channels in the convolutional filter. We validate our methods on the MNIST, CIFAR-10, and CIFAR-100 datasets using convolutional neural networks of various sizes."



Paperid:1174
Authors:Hai Victor Habi, Roy H. Jennings, Arnon Netzer
Title: HMQ: Hardware Friendly Mixed Precision Quantization Block for CNNs
Abstract:
Recent work in network quantization produced state-of-the-art results using mixed precision quantization. An imperative requirement for many efficient edge device hardware implementations is that their quantizers are uniform and with power-of-two thresholds. In this work, we introduce the Hardware Friendly Mixed Precision Quantization Block (HMQ) in order to meet this requirement. The HMQ is a mixed precision quantization block that repurposes the Gumbel-Softmax estimator into a smooth estimator of a pair of quantization parameters, namely, bit-width and threshold. HMQs use this to search over a finite space of quantization schemes. Empirically, we apply HMQs to quantize classification models trained on CIFAR10 and ImageNet. For ImageNet, we quantize four different architectures and show that, in spite of the added restrictions to our quantization scheme, we achieve competitive and, in some cases, state-of-the-art results."



Paperid:1175
Authors:Christopher Zach, Huu Le
Title: Truncated Inference for Latent Variable Optimization Problems: Application to Robust Estimation and Learning
Abstract:
Optimization problems with an auxiliary latent variable structure in addition to the main model parameters occur frequently in computer vision and machine learning. The additional latent variables make the underlying optimization task expensive, either in terms of memory (by maintaining the latent variables), or in terms of runtime (repeated exact inference of latent variables). We aim to remove the need to maintain the latent variables and propose two formally justified methods, that dynamically adapt the required accuracy of latent variable inference. These methods have applications in large scale robust estimation and in learning the parameters of an energy-based model from labeled data."



Paperid:1176
Authors:Weizeng Lu, Xi Jia, Weicheng Xie, Linlin Shen, Yicong Zhou, Jinming Duan
Title: Geometry Constrained Weakly Supervised Object Localization
Abstract:
We propose a geometry constrained network, termed GCNet, for weakly supervised object localization (WSOL). GC-Net consists of three modules: a detector, a generator and a classifier. The detector predicts the object location defined by a set of coefficients describing a geometric shape (i.e. ellipse or rectangle), which is geometrically constrained by the mask produced by the generator. The classifier takes the resulting masked images as input and performs two complementary classification tasks for the object and background. To make the mask more compact and more complete, we propose a novel multi-task loss function that takes into account area of the geometric shape, the categorical cross entropy and the negative entropy. In contrast to previous approaches, GC-Net is trained end-to-end and predict object location without any post-processing (e.g. thresholding) that may require additional tuning. Extensive experiments on the CUB-200-2011 and ILSVRC2012 datasets show that GC-Net outperforms state-of-the-art methods by a large margin. Our source code is available at https://github.com/lwzeng/GC-Net."



Paperid:1177
Authors:Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, Gemma Roig
Title: Duality Diagram Similarity: a generic framework for initialization selection in task transfer learning
Abstract:
In this paper, we tackle an open research question in transfer learning, which is selecting a model initialization to achieve high performance on a new task, given several pre-trained models. We propose a new highly efficient and accurate approach based on duality diagram similarity (DDS) between deep neural networks (DNNs). DDS is a generic framework to represent and compare data of different feature dimensions. We validate our approach on the Taskonomy dataset by measuring the correspondence between actual transfer learning performance rankings on 17 taskonomy tasks and predicted rankings. Computing DDS based ranking for $17 imes17$ transfers requires less than 2 minutes and shows a high correlation ($0.86$) with actual transfer learning rankings, outperforming state-of-the-art methods by a large margin ($10\%$) on the Taskonomy benchmark. We also demonstrate the robustness of our model selection approach to a new task, namely Pascal VOC semantic segmentation. Additionally, we show that our method can be applied to select the best layer locations within a DNN for transfer learning on 2D, 3D and semantic tasks on NYUv2 and Pascal VOC datasets."



Paperid:1178
Authors:Yaniv Benny, Lior Wolf
Title: OneGAN: Simultaneous Unsupervised Learning of Conditional Image Generation, Foreground Segmentation, and Fine-Grained Clustering
Abstract:
Foreground Segmentation, and Fine-Grained Clustering","We present a method for simultaneously learning, in an unsupervised manner, (i) a conditional image generator, (ii) foreground extraction and segmentation, (iii) clustering into a two-level class hierarchy, and (iv) object removal and background completion, all done without any use of annotation. The method combines a generative adversarial network and a variational autoencoder, with multiple encoders, generators and discriminators, and benefits from solving all tasks at once. The input to the training scheme is a varied collection of unlabeled images from the same domain, as well as a set of background images without a foreground object. In addition, the image generator can mix the background from one image, with a foreground that is conditioned either on that of a second image or on the index of a desired cluster. The method obtains state of the art results in comparison to the literature methods, when compared to the current state of the art in each of the tasks."



Paperid:1179
Authors:Nikolay Malkin, Anthony Ortiz, Nebojsa Jojic
Title: Mining self-similarity: Label super-resolution with epitomic representations
Abstract:
We show that simple patch-based models, such as epitomes (Jojic et al., 2003), can have superior performance to the current state of the art in semantic segmentation and label super-resolution, which uses deep convolutional neural networks. We derive a new training algorithm for epitomes which allows, for the first time, learning from very large data sets and derive a label super-resolution algorithm as a statistical inference algorithm over epitomic representations. We illustrate our methods on land cover mapping and medical image analysis tasks."



Paperid:1180
Authors:Dongsheng An, Yang Guo, Min Zhang, Xin Qi, Na Lei, Xianfang Gu
Title: AE-OT-GAN: Training GANs from data specific latent distribution
Abstract:
Though generative adversarial networks (GANs) are prominent models to generate realistic and crisp images, they are unstable to train and suffer from the mode col-lapse/mixture. The problems of GANs come from approximating the intrinsic discontinuous distribution transform map with continuous DNNs. The recently proposed AE-OT model addresses the discontinuity problem by explicitly computing the discontinuous optimal transform map in the latent space of the autoencoder. Though have no mode collapse/mixture, the generated images by AE-OT are blurry. In this paper, we propose the AE-OT-GAN model to utilize the advantages of the both models: generate high quality images and at the same time overcome the mode collapse/mixture problems. Specifically, we firstly embed the low dimensional image manifold into the latent space by training an autoencoder (AE). Then the extended semi-discrete optimal transport (SDOT) map from the uniform distribution to the empirical latent distribution is used to generate new latent codes. Finally, our GAN model is trained to generate high quality images from the latent distribution induced by the extended SDOT map. The distribution transform map from this dataset related latent distribution to the data distribution will be continuous, and thus can be well approximated by the continuous DNNs. Additionally, the paired data between the latent codes and the real images gives us further restriction about the genera-tor and stabilizes the training process. Experiments on simple MNIST dataset and complex datasets like CIFAR10 and CelebA show the advantages of the proposed method."



Paperid:1181
Authors:Thomas Kehrenberg, Myles Bartlett, Oliver Thomas, Novi Quadrianto
Title: Null-sampling for Interpretable and Fair Representations
Abstract:
We propose to learn invariant representations, in the data domain, to achieve interpretability in algorithmic fairness. Invariance implies a selectivity for high level, relevant correlations w.r.t. class label annotations, and a robustness to irrelevant correlations with protected characteristics such as race or gender. We introduce a non-trivial setup in which the training set exhibits a strong bias such that class label annotations are irrelevant and spurious correlations cannot be distinguished. To address this problem, we introduce an adversarially trained model with a null-sampling procedure to produce invariant representations in the data domain. To enable disentanglement, a partially-labelled representative set is used. By placing the representations into the data domain, the changes made by the model are easily examinable by human auditors. We show the effectiveness of our method on both image and tabular datasets: Coloured MNIST, the CelebA and the Adult dataset."



Paperid:1182
Authors:Lam Huynh, Phong Nguyen-Ha, Jiri Matas, Esa Rahtu, Janne Heikkilä
Title: Guiding Monocular Depth Estimation Using Depth-Attention Volume
Abstract:
Recovering the scene depth from a single image is an ill-posed problem that requires additional priors, often referred to as monocular depth cues, to disambiguate different 3D interpretations. In recent works, those priors have been learned in an end-to-end manner from large datasets by using deep neural networks. In this paper, we propose guiding depth estimation to favor planar structures that are ubiquitous especially in indoor environments. This is achieved by incorporating a non-local coplanarity constraint to the network with a novel attention mechanism called depth-attention volume (DAV). Experiments on two popular indoor datasets, namely NYU-Depth-v2 and ScanNet, show that our method achieves state-of-the-art depth estimation results while using only a fraction of the number of parameters needed by the competing methods."



Paperid:1183
Authors:Adam W. Harley, Shrinidhi Kowshika Lakshmikanth, Paul Schydlo, Katerina Fragkiadaki
Title: Tracking Emerges by Looking Around Static Scenes, with Neural 3D Mapping
Abstract:
with Neural 3D Mapping","We hypothesize that an agent that can look around in static scenes can learn rich visual representations applicable to 3D object tracking in complex dynamic scenes. We are motivated in this pursuit by the fact that the physical world itself is mostly static, and multiview correspondence labels are relatively cheap to collect in static scenes, e.g., by triangulation. We propose to leverage multiview data of static points in arbitrary scenes (static or dynamic), to learn a neural 3D mapping module which produces features that are correspondable across time. The neural 3D mapper consumes RGB-D data as input, and produces a 3D voxel grid of deep features as output. We train the voxel features to be correspondable across viewpoints, using a contrastive loss, and correspondability across time emerges automatically. At test time, given an RGB-D video with approximate camera poses, and given the 3D box of an object to track, we track the target object by generating a map of each timestep and locating the object's features within each map. In contrast to models that represent video streams in 2D or 2.5D, our model's 3D scene representation is disentangled from projection artifacts, is stable under camera motion, and is robust to partial occlusions. We test the proposed architectures in challenging simulated and real data, and show that our unsupervised 3D object trackers outperform prior unsupervised 2D and 2.5D trackers, and approach the accuracy of supervised trackers. This work demonstrates that 3D object trackers can emerge without tracking labels, through multiview self-supervision on static data. "



Paperid:1184
Authors:Yuanyi Zhong, Jianfeng Wang, Jian Peng, Lei Zhang
Title: Boosting Weakly Supervised Object Detection with Progressive Knowledge Transfer
Abstract:
In this paper, we propose an effective knowledge transfer framework to boost the weakly supervised object detection accuracy with the help of an external fully-annotated source dataset, whose categories may not overlap with the target domain. This setting is of great practical value due to the existence of many off-the-shelf detection datasets. To more effectively utilize the source dataset, we propose to iteratively transfer the knowledge from the source domain by a one-class universal detector and learn the target-domain detector. The box-level pseudo ground truths mined by the target-domain detector in each iteration effectively improve the one-class universal detector. Therefore, the knowledge in the source dataset is more thoroughly exploited and leveraged. Extensive experiments are conducted with Pascal VOC 2007 as the target weakly-annotated dataset and COCO/ImageNet as the source fully-annotated dataset. With the proposed solution, we achieved an mAP of $59.7\%$ detection performance on the VOC test set and an mAP of $60.2\%$ after retraining a fully supervised Faster RCNN with the mined pseudo ground truths. This is significantly better than any previously known results in related literature and sets a new state-of-the-art of weakly supervised object detection under the knowledge transfer setting."



Paperid:1185
Authors:Ayan Das, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song
Title: BézierSketch: A generative model for scalable vector sketches
Abstract:
The study of neural generative models of human sketches is a fascinating contemporary modeling problem due to the links between sketch image generation and the human drawing process. The landmark SketchRNN provided breakthrough by sequentially generating sketches as a sequence of waypoints. However this leads to low-resolution image generation, and failure to model long sketches. In this paper we present BézierSketch, a novel generative model for fully vector sketches that are automatically scalable and high-resolution. To this end, we first introduce a novel inverse graphics approach to stroke embedding that trains an encoder to embed each stroke to its best fit Bézier curve. This enables us to treat sketches as short sequences of paramaterized strokes and thus train a recurrent sketch generator with greater capacity for longer sketches, while producing scalable high-resolution results. We report qualitative and quantitative results on the Quick, Draw! benchmark."



Paperid:1186
Authors:Zeqi Li, Ruowei Jiang,, Parham Aarabi
Title: Semantic Relation Preserving Knowledge Distillation for Image-to-Image Translation
Abstract:
Generative adversarial networks (GANs) have shown significant potential in modeling high dimensional distributions of image data, especially on image-to-image translation tasks. However, due to the complexity of these tasks, state-of-the-art models often contain a tremendous amount of parameters, which results in large model size and long inference time. In this work, we propose a novel method to address this problem by applying knowledge distillation together with distillation of a semantic relation preserving matrix. This matrix, derived from the teacher's feature encoding, helps the student model learn better semantic relations. In contrast to existing compression methods designed for classification tasks, our proposed method adapts well to the image-to-image translation task on GANs. Experiments conducted on 5 different datasets and 3 different pairs of teacher and student models provide strong evidence that our methods achieve impressive results both qualitatively and quantitatively."



Paperid:1187
Authors:Brady Zhou, Nimit Kalra, Philipp Krähenbühl
Title: Domain Adaptation Through Task Distillation
Abstract:
Deep networks devour millions of precisely annotated images to build their complex and powerful representations. Unfortunately, tasks like autonomous driving have virtually no real-world training data. Repeatedly crashing a car into a tree is simply too expensive. The commonly prescribed solution is simple: learn a representation in simulation and transfer it to the real world. However, this transfer is challenging since simulated and real-world visual experiences vary dramatically. Our core observation is that for certain tasks, such as image recognition, datasets are plentiful. They exist in any interesting domain, simulated or real, and are easy to label and extend. We use these recognition datasets to link up a source and target domain to transfer models between them in a task distillation framework. Our method can successfully transfer navigation policies between drastically different simulators: ViZDoom, SuperTuxKart, and CARLA. Furthermore, it shows promising results on standard domain adaptation benchmarks."



Paperid:1188
Authors:Chenglin Yang, Adam Kortylewski, Cihang Xie, Yinzhi Cao, Alan Yuille
Title: PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning
Abstract:
Patch-based attacks introduce a perceptible but localized change to the input that induces misclassification. A limitation of current patch-based black-box attacks is that they perform poorly for targeted attacks, and even for the less challenging non-targeted scenarios, they require a large number of queries. Our proposed PatchAttack is query efficient and can break models for both targeted and non-targeted attacks. PatchAttack induces misclassifications by superimposing small textured patches on the input image. We parametrize the appearance of these patches by a dictionary of class-specific textures. This texture dictionary is learned by clustering Gram matrices of feature activations from a VGG backbone. PatchAttack optimizes the position and texture parameters of each patch using reinforcement learning. Our experiments show that PatchAttack achieves >99% success rate on ImageNet for a wide range of architectures, while only manipulating 3% of the image for non-targeted attacks and 10% on average for targeted attacks. Furthermore, we show that PatchAttack circumvents state-of-the-art adversarial defense methods successfully."



Paperid:1189
Authors:Yu Liu, Sarah Parisot, Gregory Slabaugh, Xu Jia, Ales Leonardis, Tinne Tuytelaars
Title: More Classifiers, Less Forgetting: A Generic Multi-classifier Paradigm for Incremental Learning
Abstract:
Less Forgetting: A Generic Multi-classifier Paradigm for Incremental Learning","Overcoming catastrophic forgetting in neural networks is a long-standing and core research objective for incremental learning. Notable studies have shown regularization strategies enable the network to remember previously acquired knowledge devoid of heavy forgetting. Since those regularization strategies are mostly associated with classifier outputs, we propose a MUlti-Classifier (MUC) incremental learning paradigm that integrates an ensemble of auxiliary classifiers to estimate more effective regularization constraints. Additionally, we extend two common methods, focusing on parameter and activation regularization, from the conventional single-classifier paradigm to MUC. Our classifier ensemble promotes regularizing network parameters or activations when moving to learn the next task. Under the setting of task-agnostic evaluation, our experimental results on CIFAR-100 and Tiny ImageNet incremental benchmarks show that our method outperforms other baselines. Specifically, MUC obtains 3%-5% accuracy boost and 4%-5% decline of forgetting ratio, compared with MAS and LwF. Our code is available at https://github.com/Liuy8/MUC. "



Paperid:1190
Authors:Bram Wallace, Bharath Hariharan
Title: Extending and Analyzing Self-Supervised Learning Across Domains
Abstract:
Self-supervised representation learning has achieved impressive results in recent years, with experiments primarily coming on ImageNet or other similarly large internet imagery datasets. There has been little to no work with these methods on other smaller domains, such as satellite, textural, or biological imagery. We experiment with several popular methods on an unprecedented variety of domains. We discover, among other findings, that Rotation is the most semantically meaningful task, while much of the performance of Jigsaw is attributable to the nature of its induced distribution rather than semantic understanding. Additionally, there are several areas, such as fine-grain classification, where all tasks underperform. We quantitatively and qualitatively diagnose the reasons for these failures and successes via novel experiments studying pretext generalization, random labelings, and implicit dimensionality. Code and models are available at https://github.com/BramSW/Extending_SSRL_Across_Domains/."



Paperid:1191
Authors:Sayan Rakshit, Dipesh Tamboli, Pragati Shuddhodhan Meshram, Biplab Banerjee, Gemma Roig, Subhasis Chaudhuri
Title: Multi-Source Open-Set Deep Adversarial Domain Adaptation
Abstract:
We introduce a novel learning paradigm based on multi-source open-set unsupervised domain adaptation (MS-OSDA). Recently, the notion of single-source open-set domain adaptation (OSDA) has drawn much attention which considers the presence of previously unseen open-set (unknown) classes in the target-domain in addition to the source-domain closed-set (known) classes. It is reasonable to assume that labeled samples may be distributed over multiple source-domains while the target-domain is equipped with both the closed-set and open-set data. However, the existing single-source OSDA techniques cannot be directly extended to such a multi-source scenario considering the inhomogeneities present among the different source-domains. As a remedy, we propose a novel adversarial learning-driven approach to deal with the MS-OSDA setup. Precisely, we model a shared feature space for all the domains while encouraging fine-grained alignment among the known-class samples. Besides, an adversarial learning strategy is followed to model the discriminator between the target-domain known and unknown classes. We validate our method on the Office-31, Office-Home, Office-CalTech, and Digits datasets and find our model to consistently outperform the relevant literature."



Paperid:1192
Authors:Wen-Hsuan Chu, Kris M. Kitani
Title: Neural Batch Sampling with Reinforcement Learning for Semi-Supervised Anomaly Detection
Abstract:
We are interested in the detection and segmentation of anomalies in images where the anomalies are typically small (i.e., a small tear in woven fabric, broken pin of an IC chip). From a statistical learning point of view, anomalies have low occurrence probability and are not from the main modes of a data distribution. Learning a generative model of anomalous data from a natural distribution of data can be difficult because the data distribution is heavily skewed towards a large amount of non-anomalous data. When training a generative model on such imbalanced data using an iterative learning algorithm like stochastic gradient descent (SGD), we observe an expected yet interesting trend in the loss values (a measure of the learned models performance) after each gradient update across data samples. Naturally, as the model sees more non-anomalous data during training, the loss values over a non-anomalous data sample decreases, while the loss values on an anomalous data sample fluctuates. In this work, our key hypothesis is that this change in loss values during training can be used as a feature to identify anomalous data. In particular, we propose a novel semi-supervised learning algorithm for anomaly detection and segmentation using an anomaly classifier that uses as input the extit{loss profile} of a data sample processed through an autoencoder. The loss profile is defined as a sequence of reconstruction loss values produced during iterative training. To amplify the difference in loss profiles between anomalous and non-anomalous data, we also introduce a Reinforcement Learning based meta-algorithm, which we call the neural batch sampler, to strategically sample training batches during autoencoder training. Experimental results on multiple datasets with a high diversity of textures and objects, often with multiple modes of defects within them, demonstrate the capabilities and effectiveness of our method when compared with existing state-of-the-art baselines."



Paperid:1193
Authors:Baoxiong Jia, Yixin Chen, Siyuan Huang, Yixin Zhu, Song-Chun Zhu
Title: LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities
Abstract:
The ability to understand and interpret human actions is a long-standing challenge and a critical indicator of perception in artificial intelligence. However, a few imperative components of daily human activities are largely missed in prior literature, including the goal-directed actions, concurrent multi-tasks, and collaborations among multi-agents. We introduce the LEMMA dataset to provide a single home to address these missing dimensions with carefully designed settings, wherein the numbers of tasks and agents vary to highlight different learning objectives. We densely annotate the atomic-actions with human-object interactions to provide ground-truth of the compositionality, scheduling, and assignment of daily activities. We further devise challenging compositional action recognition and action/task anticipation benchmarks with baseline models to measure the capability for compositional action understanding and temporal reasoning. We hope this effort inspires the vision community to look into goal-directed human activities and further study the task scheduling and assignment in real-world scenarios."



Paperid:1194
Authors:Matthew Purri, Kristin Dana
Title: Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images
Abstract:
The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping, pushing, and maneuvering. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of 15 tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile cluster classification loss. Additionally, we develop a neural architecture search framework capable of learning to select optimal combinations of viewing angles for estimating a given physical property."



Paperid:1195
Authors:José Pedro Iglesias, Carl Olsson, Marcus Valtonen Örnhag
Title: Accurate Optimization of Weighted Nuclear Norm for Non-Rigid Structure from Motion
Abstract:
Fitting a matrix of a given rank to data in a least squares sense can be done very effectively using 2nd order methods such as Levenberg-Marquardt by explicitly optimizing over a bilinear parameterization of the matrix. In contrast, when applying more general singular value penalties, such as weighted nuclear norm priors, direct optimization over the elements of the matrix is typically used. Due to non-differentiability of the resulting objective function, first order sub-gradient or splitting methods are predominantly used. While these offer rapid iterations it is well known that they become inefficent near the minimum due to zig-zagging and in practice one is therefore often forced to settle for an approximate solution.In this paper we show that more accurate results can in many cases be achieved with 2nd order methods. Our main result shows how to construct bilinear formulations, for a general class of regularizers including weighted nuclear norm penalties, that are provably equivalent to the original problems. With these formulations the regularizing function becomes twice differentiable and 2nd order methods can be applied. We show experimentally, on a number of structure from motion problems, that our approach outperforms state-of-the-art methods."



Paperid:1196
Authors:Yuan-Ting Hu, Heng Wang, Nicolas Ballas, Kristen Grauman, Alexander G. Schwing
Title: Proposal-based Video Completion
Abstract:
Video inpainting is an important technique for a wide variety of applications from video content editing to video restoration. Early approaches follow image inpainting paradigms, but are challenged by complex camera motion and non-rigid deformations. To address these challenges flow-guided propagation techniques have been proposed. However, computation of flow is non-trivial for unobserved regions and propagation across a whole video sequence is computationally demanding. In contrast, in this paper, we propose a video inpainting algorithm based on proposals: we use 3D convolutions to obtain an initial inpainting estimate which is subsequently refined by fusing a generated set of proposals. Different from existing approaches for video inpainting, and inspired by well-explored mechanisms for object detection, we argue that proposals provide a rich source of information that permits to combine similarly looking patches that may be spatially and temporally far from the region to be inpainted. We validate the effectiveness of our method on the challenging YouTube VOS and DAVIS datasets using different settings and demonstrate results outperforming state-of-the-art on standard metrics."



Paperid:1197
Authors:Haifeng Xia, Zhengming Ding
Title: HGNet: Hybrid Generative Network for Zero-shot Domain Adaptation
Abstract:
Domain Adaptation as an important tool aims to explore a generalized model trained on well-annotated source knowledge to address learning issue on target domain with insufficient or even no annotation. Current approaches typically incorporate data from source and target domains for training stage to deal with domain shift. However, most domain adaptation tasks generally suffer from the problem that measuring the domain shift tends to be impossible when target data is inaccessible. In this paper, we propose a novel algorithm, Hybrid Generative Network (HGNet) for Zero-shot Domain Adaptation, which embeds an adaptive feature separation (AFS) module into generative architecture. Specifically, AFS module can adaptively distinguish classification-relevant features from classification-irrelevant ones to learn domain-invariant and discriminative representations when task-relevant target instances are invisible. To learn high-quality feature representation, we also develop hybrid generative strategy to ensure the uniqueness of feature separation and completeness of semantic information. Extensive experimental results on several benchmarks illustrate that our method achieves more promising results than state-of-the-art approaches."



Paperid:1198
Authors:Kaihao Zhang, Wenhan Luo, Wenqi Ren, Jingwen Wang Fang Zhao, Lin Ma , Hongdong Li
Title: Beyond Monocular Deraining: Stereo Image Deraining via Semantic Understanding
Abstract:
Rain is a common natural phenomenon. Taking images in the rain however often results in degraded quality of images, thus compromises the performance of many computer vision systems. Most existing de-rain algorithms use only one single input image and aim to recover a clean image. Few work has exploited stereo images. Moreover, even for single image based monocular deraining, many current methods fail to complete the task satisfactorily because they mostly rely on per pixel loss functions and ignoring semantic information. In this paper, we present a Paired Rain Removal Network (PRRNet), which exploits both stereo images and semantic information. Specifically, we develop a Semantic-Aware Deraining Module (SADM) which solves both tasks of semantic segmentation and deraining of scenes, and a Semantic-Fusion Network (SFNet) and a View-Fusion Network (VFNet) which fuses semantic information and multi-view information respectively. We also propose a new stereo based rainy datasets for benchmarking. Experiments on both monocular and the newly proposed stereo rainy datasets demonstrate that the proposed method achieves the state-of-the-art performance."



Paperid:1199
Authors:Hassan Dbouk, Hetul Sanghvi, Mahesh Mehendale, Naresh Shanbhag
Title: DBQ: A Differentiable Branch Quantizer for Lightweight Deep Neural Networks
Abstract:
Deep neural networks have achieved state-of-the art performance on various computer vision tasks. However, their deployment on resource-constrained devices has been hindered due to their high computational and storage complexity. While various complexity reduction techniques, such as lightweight network architecture design and parameter quantization, have been successful in reducing the cost of implementing these networks, these methods have often been considered orthogonal. In reality, existing quantization techniques fail to replicate their success on lightweight architectures such as MobileNet. To this end, we present a novel fully differentiable non-uniform quantizer that can be seamlessly mapped onto efficient ternary-based dot product engines. We conduct comprehensive experiments on CIFAR-10, ImageNet, and Visual Wake Words datasets. The proposed quantizer (DBQ) successfully tackles the daunting task of aggressively quantizing lightweight networks such as MobileNetV1, MobileNetV2, and ShuffleNetV2. DBQ achieves state-of-the art results with minimal training overhead and provides the best (pareto-optimal) accuracy-complexity trade-off."



Paperid:1200
Authors:Zhixiang Chi, Rasoul Mohammadi Nasiri, Zheng Liu, Juwei Lu, Jin Tang , Konstantinos N Plataniotis
Title: All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced Motion Modeling
Abstract:
Recent advances in high refresh rate displays as well as the increased interest in high rate of slow motion and frame up-conversion fuel the demand for efficient and cost-effective multi-frame video interpolation solutions. To that regard, inserting multiple frames between consecutive video frames are of paramount importance for the consumer electronics industry. State-of-the-art methods are iterative solutions interpolating one frame at the time. They introduce temporal inconsistencies and clearly noticeable visual artifacts. Departing from the state-of-the-art, this work introduces a true multi-frame interpolator. It utilizes a pyramidal style network in the temporal domain to complete the multi-frame interpolation task in one-shot. A novel flow estimation procedure using a relaxed loss function, and an advanced, cubic-based, motion model is also used to further boost interpolation accuracy when complex motion segments are encountered. Results on the Adobe240 dataset show that the proposed method generates visually pleasing, temporally consistent frames, outperforms the current best off-the-shelf method by 1.57db in PSNR with 8 times smaller model and 7.7 times faster. The proposed method can be easily extended to interpolate a large number of new frames while remaining efficient because of the one-shot mechanism."



Paperid:1201
Authors:Yunhui Guo, Noel C. Codella, Leonid Karlinsky, James V. Codella, John R. Smith, Kate Saenko, Tajana Rosing, Rogerio Feris
Title: A Broader Study of Cross-Domain Few-Shot Learning
Abstract:
Recent progress on few-shot learning largely relies on annotated data for meta-learning: base classes sampled from the same domain as the novel classes. However, in many applications, collecting data for meta-learning is infeasible or impossible. This leads to the cross-domain few-shot learning problem, where there is a large shift between base and novel class domains. While investigations of the cross-domain few-shot scenario exist, these works are limited to natural images that still contain a high degree of visual similarity. No work yet exists that examines few-shot learning across different imaging methods seen in real world scenarios, such as aerial and medical imaging. In this paper, we propose the Broader Study of Cross-Domain Few-Shot Learning (BSCD-FSL) benchmark, consisting of image data from a diverse assortment of image acquisition methods. This includes natural images, such as crop disease images, but additionally those that present with an increasing dissimilarity to natural images, such as satellite images, dermatology images, and radiology images. Extensive experiments on the proposed benchmark are performed to evaluate state-of-art meta-learning approaches, transfer learning approaches, and newer methods for cross-domain few-shot learning. The results demonstrate that state-of-art meta-learning methods are surprisingly outperformed by earlier meta-learning approaches, and all meta-learning methods underperform in relation to simple fine-tuning by 12.8% average accuracy. In some cases, meta-learning even underperforms networks with random weights. Performance gains previously observed with methods specialized for cross-domain few-shot learning vanish in this more challenging benchmark. Finally, accuracy of all methods tend to correlate with dataset similarity to natural images, verifying the value of the benchmark to better represent the diversity of data seen in practice and guiding future research. Code for all experiments in this work will be made available on GitHub."



Paperid:1202
Authors:Junfeng Guo, Cong Liu
Title: Practical Poisoning Attacks on Neural Networks
Abstract:
Data poisoning attacks on machine learning models have attracted much recent attention, wherein poisoning samples are injected at the training phase to achieve adversarial goals at test time. Although existing poisoning techniques prove to be effective in various scenarios, they rely on certain assumptions on the adversary knowledge and capability to ensure efficacy, which may be unrealistic in practice. This paper presents a new, practical targeted poisoning attack method on neural networks in vision domain, namely BlackCard. BlackCard possesses a set of critical properties for ensuring attacking efficacy in practice, which has never been simultaneously achieved by any existing work, including knowledge-oblivious, clean-label, and clean-test. Importantly, we show that the effectiveness of BlackCard can be intuitively guaranteed by a set of analytical reasoning and observations, through exploiting an essential characteristic of gradient-descent optimization which is pervasively adopted in DNN models. We evaluate the efficacy of BlackCard for generating targeted poisoning attacks via extensive experiments using various datasets and DNN models. Results show that BlackCard is effective with a rather high success rate while preserving all the claimed properties."



Paperid:1203
Authors:Djebril Mekhazni, Amran Bhuiyan, George Ekladious, Eric Granger
Title: Unsupervised Domain Adaptation in the Dissimilarity Space for Person Re-identification
Abstract:
Person re-identification (ReID) remains a challenging task in many real-word video analytics and surveillance applications, even though state-of-the-art accuracy has improved considerably with the advent of deep learning (DL) models trained on large image datasets. Given the shift in distributions that typically occurs between video data captured from the source and target domains, and absence of labeled data from the target domain, it is difficult to adapt a DL model for accurate recognition of target data. DL models for unsupervised domain adaptation (UDA) are commonly designed in the feature representation space. We argue that for pair-wise matchers that rely on metric learning, e.g., Siamese networks for person ReID, the UDA objective should consist in aligning pair-wise dissimilarity between domains, rather than aligning feature representations. Moreover, dissimilarity representations are more suitable for designing open-set ReID systems, where identities differ in the source and target domains. In this paper, we propose a novel Dissimilarity-based Maximum Mean Discrepancy (D-MMD) loss for aligning pair-wise distances that can be optimized via gradient descent using relatively small batch sizes. From a person ReID perspective, the evaluation of D-MMD loss is straightforward since the tracklet information (provided by a person tracker) allows to label a distance vector as being either within-class (within-tracklet) or between-class (between-tracklet). This allows approximating the underlying distribution of target pair-wise distances for D-MMD loss optimization, and accordingly align source and target distance distributions. Empirical results with three challenging benchmark datasets show that the proposed D-MMD loss decreases as source and domain distributions become more similar. Extensive experimental evaluation also indicates that UDA methods that rely on the D-MMD loss can significantly outperform baseline and state-of-the-art UDA methods for person ReID. The dissimilarity space transformation allows to design reliable pair-wise matchers, without the common requirement for data augmentation and/or complex networks. Code is available on GitHub link: https://github.com/djidje/D-MMD"



Paperid:1204
Authors:Hui Qu, Yikai Zhang, Qi Chang, Zhennan Yan, Chao Chen, Dimitris Metaxas
Title: Learn distributed GAN with Temporary Discriminators
Abstract:
In this work, we propose a method for training distributed GAN with sequential temporary discriminators. Our proposed method tackles the challenge of training GAN in the federated learning manner: How to update the generator with a flow of temporary discriminators? We apply our proposed method to learn a self-adaptive generator with a series of local discriminators from multiple data centers. We show our design of loss function indeed learns the correct distribution with provable guarantees. The empirical experiments show that our approach is capable of generating synthetic data which is practical for real-world applications such as training a segmentation model. Our TDGAN Code is available at: https://github.com/huiqu18/TDGAN-PyTorch."



Paperid:1205
Authors:Leo F Isikdogan, Bhavin V Nayak, Chyuan-Tyng Wu, Joao Peralta Moreira , Sushma Rao, Gilad Michael
Title: SemifreddoNets: Partially Frozen Neural Networks for Efficient Computer Vision Systems
Abstract:
We propose a system comprised of fixed-topology neural networks having partially frozen weights, named SemifreddoNets. SemifreddoNets work as fully-pipelined hardware blocks that are optimized to have an efficient hardware implementation. Those blocks freeze a certain portion of the parameters at every layer and replace the corresponding multipliers with fixed scalers. Fixing the weights reduces the silicon area, logic delay, and memory requirements, leading to significant savings in cost and power consumption. Unlike traditional layer-wise freezing approaches, SemifreddoNets make a profitable trade between the cost and flexibility by having some of the weights configurable at different scales and levels of abstraction in the model. Although fixing the topology and some of the weights somewhat limits the flexibility, we argue that the efficiency benefits of this strategy outweigh the advantages of a fully configurable model for many use cases. Furthermore, our system uses repeatable blocks, therefore it has the flexibility to adjust model complexity without requiring any hardware change. The hardware implementation of SemifreddoNets provides up to an order of magnitude reduction in silicon area and power consumption as compared to their equivalent implementation on a general-purpose accelerator."



Paperid:1206
Authors:Anh Bui, Trung Le, He Zhao, Paul Montague, Olivier deVel, Tamas Abraham, Dinh Phung
Title: Improving Adversarial Robustness by Enforcing Local and Global Compactness
Abstract:
The fact that deep neural networks are susceptible to crafted perturbations severely impacts the use of deep learning in certain domains of application. Among many developed defense models against such attacks, adversarial training emerges as the most successful method that consistently resists a wide range of attacks. In this work, based on an observation from a previous study that the representations of a clean data example and its adversarial examples become more divergent in higher layers of a deep neural net, we propose the Adversary Divergence Reduction Network which enforces local/global compactness and the clustering assumption over an intermediate layer of a deep neural network. We conduct comprehensive experiments to understand the isolating behavior of each component (i.e., local/global compactness and the clustering assumption) and compare our proposed model with state-of-the-art adversarial training methods. The experimental results demonstrate that augmenting adversarial training with our proposed components can further improve the robustness of the network, leading to higher unperturbed and adversarial predictive performances."



Paperid:1207
Authors:Subeesh Vasu, Mateusz Kozinski, Leonardo Citraro, and Pascal Fua
Title: TopoAL: An Adversarial Learning Approach for Topology-Aware Road Segmentation
Abstract:
Most state-of-the-art approaches to road extraction from aerial images rely on a CNN trained to label road pixels as foreground and remainder of the image as background. The CNN is usually trained by minimizing pixel-wise losses, which is less than ideal to produce binary masks that preserve the road network's global connectivity. To address this issue, we introduce an Adversarial Learning (AL) strategy tailored for our purposes. A naive one would treat the segmentation network as a generator and would feed its output along with ground-truth segmentations to a discriminator. It would then train the generator and discriminator jointly. We will show that this is not enough because it does not capture the fact that most errors are local and need to be treated as such. Instead, we use a more sophisticated discriminator that returns a label pyramid describing what portions of the road network are correct at several different scales. This discriminator and the structured labels it returns are what gives our approach its edge and we will show that it outperforms state-of-the-art ones on the challenging RoadTracer dataset."



Paperid:1208
Authors:Charles Herrmann, Richard Strong Bowen, Ramin Zabih
Title: Channel selection using Gumbel Softmax
Abstract:
Important applications such as mobile computing require reducing the computational costs of neural network inference. Ideally, applications would specify their preferred tradeoff between accuracy and speed, and the network would optimize this end-to-end, using classification error to remove parts of the network. Increasing speed can be done either during training – e.g., pruning filters – or during inference – e.g., conditionally executing a subset of the layers. We propose a single end-to-end framework that can improve inference efficiency in both settings. We use a combination of batch activation loss and classification loss, and Gumbel reparameterization to learn network structure. We train end-to-end, and the same technique supports pruning as well as conditional computation. We obtain promising experimental results for ImageNet classification with ResNet (45-52% less computation)."



Paperid:1209
Authors:Dripta S. Raychaudhuri, Amit K. Roy-Chowdhury
Title: Exploiting Temporal Coherence for Self-Supervised One-shot Video Re-identification
Abstract:
While supervised techniques in re-identification are extremely effective, the need for large amounts of annotations makes them impractical for large camera networks. One-shot re-identification, which uses a singular labeled tracklet for each identity along with a pool of unlabeled tracklets, is a potential candidate towards reducing this labeling effort. Current one-shot re-identification methods function by modeling the inter-relationships amongst the labeled and the unlabeled data, but fail to fully exploit such relationships that exist within the pool of unlabeled data itself. In this paper, we propose a new framework named Temporal Consistency Progressive Learning, which uses temporal coherence as a novel self-supervised auxiliary task in the one-shot learning paradigm to capture such relationships amongst the unlabeled tracklets. Optimizing two new losses, which enforce consistency on a local and global scale, our framework can learn learn richer and more discriminative representations. Extensive experiments on two challenging video re-identification datasets - MARS and DukeMTMC-VideoReID - demonstrate that our proposed method is able to estimate the true labels of the unlabeled data more accurately by up to $8\%$, and obtain significantly better re-identification performance compared to the existing state-of-the-art techniques."



Paperid:1210
Authors:Zixuan Jiang, Keren Zhu, Mingjie Liu, Jiaqi Gu, David Z. Pan
Title: An Efficient Training Framework for Reversible Neural Architectures
Abstract:
As machine learning models and dataset escalate in scales rapidly, the huge memory footprint impedes efficient training. Reversible operators can reduce memory consumption by discarding intermediate feature maps in forward computations and recover them via their inverse functions in the backward propagation. They save memory at the cost of computation overhead. However, current implementations of reversible layers mainly focus on saving memory usage with computation overhead neglected. In this work, we formulate the decision problem for reversible operators with training time as the objective function and memory usage as the constraint. By solving this problem, we can maximize the training throughput for reversible neural architectures. Our proposed framework fully automates this decision process, empowering researchers to develop and train reversible neural networks more efficiently."



Paperid:1211
Authors:Viveka Kulharia, Siddhartha Chandra, Amit Agrawal, Philip Torr, Ambrish Tyagi
Title: Box2Seg: Attention Weighted Loss and Discriminative Feature Learning for Weakly Supervised Segmentation
Abstract:
We propose a weakly supervised approach to semantic segmentation using bounding box annotations. Bounding boxes are treated as noisy labels for the foreground objects. We predict a per-class attention map that saliently guides the per-pixel cross entropy loss to focus on foreground pixels and refines the segmentation boundaries. This avoids propagating erroneous gradients due to incorrect foreground labels on the background. Additionally, we learn pixel embeddings to simultaneously optimize for high intra-class feature affinity while increasing discrimination between features across different classes. Our method, Box2Seg, achieves state-of-the-art segmentation accuracy on PASCAL VOC 2012 by significantly improving the mIOU metric by 2.1% compared to previous weakly supervised approaches. Our weakly supervised approach is comparable to the recent fully supervised methods when fine-tuned with limited amount of pixel-level annotations. Qualitative results and ablation studies show the benefit of different loss terms on the overall performance."



Paperid:1212
Authors:Yicheng Wu, Vivek Boominathan, Xuan Zhao, Jacob T. Robinson, Hiroshi Kawasaki, Aswin Sankaranarayanan, Ashok Veeraraghavan
Title: FreeCam3D: Snapshot Structured Light 3D with Freely-Moving Cameras
Abstract:
A 3D imaging and mapping system that can handle both multiple-viewers and dynamic-objects is attractive for many applications. We propose a freeform structured light system that does not rigidly constrain camera(s) to the projector. By introducing an optimized phase-coded aperture in the projector, we transform the projector pattern to encode depth in its defocus robustly; this allows a camera to estimate depth, in projector coordinates, using local information. Additionally, we project a Kronecker-multiplexed pattern that provides global context to establish correspondence between camera and projector pixels. Together with aperture coding and projected pattern, the projector offers a unique 3D labeling for every location of the scene. The projected pattern can be observed in part or full by any camera, to reconstruct both the 3D map of the scene and the camera pose in the projector coordinates. This system is optimized using a fully differentiable rendering model and a CNN-based reconstruction. We build a prototype and demonstrate high-quality 3D reconstruction with an unconstrained camera, for both dynamic scenes and multi-camera systems."



Paperid:1213
Authors:Shanjiaoyang Huang, Weiqi Peng, Zhiwei Jia, Zhuowen Tu
Title: One-Pixel Signature: Characterizing CNN Models for Backdoor Detection
Abstract:
We tackle the convolution neural networks (CNNs) backdoor detection problem by proposing a new representation called one-pixel signature. Our task is to detect/classify if a CNN model has been maliciously inserted with an unknown Trojan trigger or not. Here, each CNN model is associated with a signature that is created by generating, pixel-by-pixel, an adversarial value that is the result of the largest change to the class prediction. The one-pixel signature is agnostic to the design choice of CNN architectures, and how they were trained. It can be computed efficiently for a black-box CNN model without accessing the network parameters. Our proposed one-pixel signature demonstrates a substantial improvement (by around $30\%$ in the absolute detection accuracy) over the existing competing methods for backdoored CNN detection/classification. One-pixel signature is a general representation that can be used to characterize CNN models beyond backdoor detection."



Paperid:1214
Authors:Linchao Zhu, Sercan . Arık, Yi Yang, Tomas Pfister
Title: Learning to Transfer Learn: Reinforcement Learning-Based Selection for Adaptive Transfer Learning
Abstract:
We propose a novel adaptive transfer learning framework, learning to transfer learn (L2TL), to improve performance on a target dataset by careful extraction of the related information from a source dataset. Our framework considers cooperative optimization of shared weights between models for source and target tasks, and adjusts the constituent loss weights adaptively. The adaptation of the weights is based on a reinforcement learning (RL) selection policy, guided with a performance metric on the target validation set. We demonstrate that given fixed models, L2TL outperforms fine-tuning baselines and other adaptive transfer learning methods on eight datasets. In the regimes of small-scale target datasets and significant label mismatch between source and target datasets, L2TL shows particularly large benefits."



Paperid:1215
Authors:Hao Wang, Guosheng Lin, Steven C. H. Hoi, Chunyan Miao
Title: Structure-Aware Generation Network for Recipe Generation from Images
Abstract:
Sharing food has become very popular with the development of social media. For many real-world applications, people are keen to know the underlying recipes of a food item. In this paper, we are interested in automatically generating cooking instructions for food. We investigate an open research task of generating cooking instructions based on only food images and ingredients, which is similar to the image captioning task. However, compared with image captioning datasets, the target recipes are long-length paragraphs and do not have annotations on structure information. To address the above limitations, we propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task. Our approach brings together several novel ideas in a systematic framework: (1) exploiting an unsupervised learning approach to obtain the sentence-level tree structure labels before training; (2) generating trees of target recipes from images with the supervision of tree structure labels learned from (1); and (3) integrating the inferred tree structures with the recipe generation procedure. Our proposed model can produce high-quality and coherent recipes, and achieve the state-of-the-art performance on the benchmark Recipe1M dataset."



Paperid:1216
Authors:Qi Qi, Yan Yan, Zixuan Wu, Xiaoyu Wang, Tianbao Yang
Title: A Simple and Effective Framework for Pairwise Deep Metric Learning
Abstract:
Deep metric learning (DML) has received much attention in deep learning due to its wide applications in computer vision. Previous studies have focused on designing complicated losses and hard example mining methods, which are mostly heuristic and lack of theoretical understanding. In this paper, we cast DML as a simple pairwise binary classification problem that classifies a pair of examples as similar or dissimilar. It identifies the most critical issue in this problem---imbalanced data pairs. To tackle this issue, we propose a simple and effective framework to sample pairs in a batch of data for updating the model. The key to this framework is to define a robust loss for all pairs over a mini-batch of data, which is formulated by distributionally robust optimization. The flexibility in constructing the {\it uncertainty decision set} of the dual variable allows us to recover state-of-the-art complicated losses and also to induce novel variants. Empirical studies on several benchmark data sets demonstrate that our simple and effective method outperforms the state-of-the-art results."



Paperid:1217
Authors:Eugene Lee, Evan Chen, Chen-Yi Lee
Title: Meta-rPPG: Remote Heart Rate Estimation Using a Transductive Meta-Learner
Abstract:
Remote heart rate estimation is the measurement of heart rate without any physical contact with the subject and is accomplished using remote photoplethysmography (rPPG) in this work. rPPG signals are usually collected using a video camera with a limitation of being sensitive to multiple contributing factors, e.g. variation in skin tone, lighting condition and facial structure. End-to-end supervised learning approach performs well when training data is abundant, covering a distribution that doesn't deviate too much from the distribution of testing data or during deployment. To cope with the unforeseeable distributional changes during deployment, we propose a transductive meta-learner that takes unlabeled samples during testing (deployment) for a self-supervised weight adjustment (also known as transductive inference), providing fast adaptation to the distributional changes. Using this approach, we achieve state-of-the-art performance on MAHNOB-HCI and UBFC-rPPG."



Paperid:1218
Authors:Kara Marie Schatz, Erik Quintanilla, Shruti Vyas, Yogesh S Rawat
Title: A Recurrent Transformer Network for Novel View Action Synthesis
Abstract:
In this work, we address the problem of synthesizing human actions from novel views. Given an input video of an actor performing some action, we aim to synthesize a video with the same action performed from a novel view with the help of an appearance prior. We propose an end-to-end deep network to solve this problem. The proposed network utilizes the change in viewpoint to transform the action from the input view to the novel view in feature space. The transformed action is integrated with the target appearance using the proposed recurrent transformer network, which provides a transformed appearance for each time-step in the action sequence. The recurrent transformer network utilize action key-points which are determined in an unsupervised approach using the encoded action features. We also propose a hierarchical structure for the recurrent transformation which further improves the performance. We demonstrate the effectiveness of the proposed method through extensive experiments conducted on a large-scale multi-view action recognition NTU-RGB+D dataset. In addition, we show that the proposed method can transform the action to a novel viewpoint with an entirely different scene or actor. The code is publicly available at https://github.com/schatzkara/cross-view-video."



Paperid:1219
Authors:Shruti Vyas, Yogesh S Rawat, Mubarak Shah
Title: Multi-view Action Recognition using Cross-view Video Prediction
Abstract:
In this work, we address the problem of action recognition in a multi-view environment. Most of the existing approaches utilize pose information for multi-view action recognition. We focus on RGB modality instead and propose an unsupervised representation learning framework, which encodes the scene dynamics in videos captured from multiple viewpoints via predicting actions from unseen views. The framework takes multiple short video clips from different viewpoints and time as input and learns a global internal representation which is used to predict a video clip from an unseen viewpoint and time. The ability of the proposed network to render unseen video frames enables it to learn a meaningful and robust representation of the scene dynamics. We evaluate the effectiveness of the learned representation for multi-view video action recognition in a supervised approach. We observe a significant improvement in the performance with RGB modality on NTU-RGB+D dataset, which is the largest dataset for multi-view action recognition. The proposed framework also achieves state-of-the-art results with depth modality, which validates the generalization capability of the approach to other data modalities. The code is publicly available at https://github.com/svyas23/cross-view-action."



Paperid:1220
Authors:Mingmin Zhen, Shiwei Li, Lei Zhou, Jiaxiang Shang, Haoan Feng, Tian Fang, Long Quan
Title: Learning Discriminative Feature with CRF for Unsupervised Video Object Segmentation
Abstract:
In this paper, we introduce a novel network, called discriminative feature network (DFNet), to address the unsupervised video object segmentation task. To capture the inherent correlation among video frames, we learn K discriminative features (D-features) from the input image and reference images that reveal feature distribution from a global perspective. The D-features are then used to establish correspondence with all features of input image under conditional random field (CRF) formulation, which is leveraged to boost consistency between pixels. The experiments verify that DFNet outperforms state-of-the-art methods by a large margin with a mean IoU score of 83.4\% and ranks first on the DAVIS-2016 leaderboard while using much fewer parameters and achieving much faster speed during inference phase. We further evaluate DFNet on the FBMS dataset and the video saliency dataset ViSal, reaching a new state-of-the-art. To further demonstrate the generalizability of our framework, DFNet is also applied to the image object co-segmentation task. We perform experiments on a challenging dataset PASCAL-VOC and observe the superiority of DFNet. The thorough experiments verify that DFNet is able to capture and mine the underlying relations of images and discover the common foreground objects. "



Paperid:1221
Authors:Sriram N N, Buyu Liu, Francesco Pittaluga, Manmohan Chandraker
Title: SMART: Simultaneous Multi-Agent Recurrent Trajectory Prediction
Abstract:
We propose advances that address two key challenges in future trajectory prediction: (i) multimodality in both training data and predictions and (ii) constant time inference regardless of number of agents. Existing trajectory predictions are fundamentally limited by lack of diversity in training data, which is difficult to acquire with sufficient coverage of possible modes. Our first contribution is an automatic method to simulate diverse trajectories in the top-view. It uses pre-existing datasets and maps as initialization, mines existing trajectories to represent realistic driving behaviors and uses a multi-agent vehicle dynamics simulator to generate diverse new trajectories that cover various modes and are consistent with scene layout constraints. Our second contribution is a novel method that generates diverse predictions while accounting for scene semantics and multi-agent interactions, with constant-time inference independent of the number of agents. We propose a convLSTM with novel state pooling operations and losses to predict scene-consistent states of multiple agents in a single forward pass, along with a CVAE for diversity. We validate our proposed multi-agent trajectory prediction approach by training and testing on the proposed simulated dataset and existing real datasets of traffic scenes. In both cases, our approach outperforms SOTA methods by a large margin, highlighting the benefits of both our diverse dataset simulation and constant-time diverse trajectory prediction methods."



Paperid:1222
Authors:Jinyu Yang, Weizhi An, Sheng Wang, Xinliang Zhu, Chaochao Yan, Junzhou Huang
Title: Label-Driven Reconstruction for Domain Adaptation in Semantic Segmentation
Abstract:
Unsupervised domain adaptation enables to alleviate the need for pixel-wise annotation in the semantic segmentation. One of the most common strategies is to translate images from the source domain to the target domain and then align their marginal distributions in the feature space using adversarial learning. However, source-to-target translation enlarges the bias in translated images and introduces extra computations, owing to the dominant data size of the source domain. Furthermore, consistency of the joint distribution in source and target domains cannot be guaranteed through global feature alignment. Here, we present an innovative framework, designed to mitigate the image translation bias and align cross-domain features with the same category. This is achieved by 1) performing the target-to-source translation and 2) reconstructing both source and target images from their predicted labels. Extensive experiments on adapting from synthetic to real urban scene understanding demonstrate that our framework competes favorably against existing state-of-the-art methods."



Paperid:1223
Authors:Chi-Chong Wong, Chi-Man Vong
Title: Efficient Outdoor 3D Point Cloud Semantic Segmentation for Critical Road Objects and Distributed Contexts
Abstract:
Large-scale point cloud semantic understanding is an important problem in self-driving cars and autonomous robotics navigation. However, such problem involves many challenges, such as i) critical road objects (e.g., pedestrians, barriers) with diverse and varying input shapes; ii) distributed contextual information across large spatial range; iii) efficient inference time. Failing to deal with such challenges may weaken the mission-critical performance of self-driving car, e.g, LiDAR road objects perception. In this work, we propose a novel neural network model called Attention-based Dynamic Convolution Network with Self-Attention Global Contexts(ADConvnet-SAGC), which i) applies attention mechanism to adaptively focus on the most related neighboring points for learning the point features of 3D objects, especially for small objects with diverse shapes; ii) applies self-attention module for efficiently capturing long-range distributed contexts from the input; iii) a more reasonable and compact architecture for efficient inference. Extensive experiments on point cloud semantic segmentation validate the effectiveness of the proposed ADConvnet-SAGC model and show significant improvements over state-of-the-art methods."



Paperid:1224
Authors:Mayank Singh, Nupur Kumari, Puneet Mangla, Abhishek Sinha, Vineeth N Balasubramanian, Balaji Krishnamurthy
Title: Attributional Robustness Training using Input-Gradient Spatial Alignment
Abstract:
Interpretability is an emerging area of research in trustworthy machine learning. Safe deployment of machine learning system mandates that the prediction and its explanation be reliable and robust. Recently, it has been shown that the explanations could be manipulated easily by adding visually imperceptible perturbations to the input while keeping the model's prediction intact. In this work, we study the problem of attributional robustness (i.e. models having robust explanations) by showing an upper bound for attributional vulnerability in terms of spatial correlation between the input image and its explanation map. We propose a training methodology that learns robust features by minimizing this upper bound using soft-margin triplet loss. Our methodology of robust attribution training (ART) achieves the new state-of-the-art attributional robustness measure by a margin of $pprox$ 6-18 $\%$ on several standard datasets, ie. SVHN, CIFAR-10 and GTSRB. We further show the utility of the proposed robust training technique (ART) in the downstream task of weakly supervised object localization by achieving the new state-of-the-art performance on CUB-200 dataset."



Paperid:1225
Authors:Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuzza, Tom Drummond, Nick Barnes, Lindsay Kleeman, Robert Mahony
Title: Reducing the Sim-to-Real Gap for Event Cameras
Abstract:
Event cameras are paradigm-shifting novel sensors that report asynchronous, per-pixel brightness changes called `events' with unparalleled low latency. This makes them ideal for high speed, high dynamic range scenes where conventional cameras would fail. Recent work has demonstrated impressive results using Convolutional Neural Networks (CNNs) for video reconstruction and optic flow with events. We present strategies for improving training data for event based CNNs that result in 20-40\% boost in performance of existing state-of-the-art (SOTA) video reconstruction networks retrained with our method, and up to 15\% for optic flow networks. A challenge in evaluating event based video reconstruction is lack of quality ground truth images in existing datasets. To address this, we present a new extbf{High Quality Frames (HQF)} dataset, containing events and ground truth frames from a DAVIS240C that are well-exposed and minimally motion-blurred. We evaluate our method on HQF + several existing major event camera datasets."



Paperid:1226
Authors:Liangliang Ren, Yangyang Song, Jiwen Lu, Jie Zhou
Title: Spatial Geometric Reasoning for Room Layout Estimation via Deep Reinforcement Learning
Abstract:
Unlike most existing works that define room layout on a 2D image, we model the layout in 3D as a configuration of the camera and the room. Our spatial geometric representation with only seven variables is more concise but effective, and more importantly enables direct 3D reasoning, e.g. how the camera is positioned relative to the room. This is particularly valuable in applications such as indoor robot navigation. We formulate the problem as a Markov decision process, in which the layout is incrementally adjusted based on the difference between the current layout and the target image, and the policy is learned via deep reinforcement learning. Our framework is end-to-end trainable, requiring no extra optimization, and achieves competitive performance on two challenging room layout datasets."



Paperid:1227
Authors:Barret Zoph, Ekin D. Cubuk, Golnaz Ghiasi, Tsung-Yi Lin, Jonathon Shlens, Quoc V. Le
Title: Learning Data Augmentation Strategies for Object Detection
Abstract:
Much research on object detection focuses on building better model architectures and detection algorithms. Changing the model architecture, however, comes at the cost of adding more complexity to inference, making models slower. Data augmentation, on the other hand, doesn't add any inference complexity, but is insufficiently studied in object detection for two reasons. First it is more difficult to design plausible augmentation strategies for object detection than for classification, because one must handle the complexity of bounding boxes if geometric transformations are applied. Secondly, data augmentation attracts less research attention perhaps because it is believed to add less value and to transfer poorly compared to advances in network architectures. This paper serves two main purposes. First, we propose to use AutoAugment to design better data augmentation strategies for object detection because it can address the difficulty of designing them. Second, we use the method to assess the value of data augmentation in object detection and compare it against the value of architectures. Our investigation into data augmentation for object detection identifies two surprising results. First, by changing the data augmentation strategy to our method, AutoAugment for detection, we can improve RetinaNet with a ResNet-50 backbone from 36.7 to 39.0 mAP on COCO, a difference of +2.3mAP. This gain exceeds the gain achieved by switching the backbone from ResNet-50 to ResNet-101 (+2.1mAP), which incurs additional training and inference costs. The second surprising finding is that our strategies found on the COCO dataset transfer well to the PASCAL dataset to improve accuracy by +2.7mAP. These results together with our systematic studies of data augmentation call into question previous assumptions about the role and transferability of architectures versus data augmentation. In particular, changing the augmentation may lead to performance gains that are equally transferable as changing the underlying architecture."



Paperid:1228
Authors:Xiyang Dai, Dongdong Chen, Mengchen Liu, Yinpeng Chen, Lu Yuan
Title: DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search
Abstract:
Efficient search is a core issue in Neural Architecture Search (NAS). It is difficult for conventional NAS algorithms to directly search the architectures on large-scale tasks like ImageNet. In general, the cost of GPU hours for NAS grows with regard to training dataset size and candidate set size. One common way is searching on a smaller proxy dataset (e.g., CIFAR-10) and then transferring to the target task (e.g., ImageNet). These architectures optimized on proxy data are not guaranteed to be optimal on the target task. Another common way is learning with a smaller candidate set, which may require expert knowledge and indeed betrays the essence of NAS. In this paper, we present DA-NAS that can directly search the architecture for large-scale target tasks while allowing a large candidate set in a more efficient manner. Our method is based on an interesting observation that the learning speed for blocks in deep neural networks is related to the difficulty of recognizing distinct categories. We carefully design a progressive data adapted pruning strategy for efficient architecture search. It will quickly trim low performed blocks on a subset of target dataset) and then gradually find the best blocks on the whole target dataset. At this time, the original candidate set becomes as compact as possible, providing a faster search in the target task. Experiments on ImageNet verify the effectiveness of our approach. It is $ extbf{2}oldsymbol{ imes}$ faster than previous methods while the accuracy is currently state-of-the-art, at $ extbf{76.2%}$ under small FLOPs constraint. It supports an argument search space (i.e., more candidate blocks) to efficiently search the best-performing architecture."



Paperid:1229
Authors:Steven Spratley, Krista Ehinger, Tim Miller
Title: A Closer Look at Generalisation in RAVEN
Abstract:
Humans have a remarkable capacity to draw parallels between concepts, generalising their experience to new domains. This skill is essential to solving the visual problems featured in the RAVEN and PGM datasets, yet, previous papers have scarcely tested how well models generalise across tasks. Additionally, we encounter a critical issue that allows existing models to inadvertently 'cheat' problems in RAVEN. We therefore propose a simple workaround to resolve this issue, and focus the conversation on generalisation performance, as this was severely affected in the process. We revise the existing evaluation, and introduce two relational models, Rel-Base and Rel-AIR, that significantly improve this performance. To our knowledge, Rel-AIR is the first method to employ unsupervised scene decomposition in solving abstract visual reasoning problems, and along with Rel-Base, sets states-of-the-art for image-only reasoning and generalisation across both RAVEN and PGM."



Paperid:1230
Authors:Xier Chen, Yanchao Lian, Licheng Jiao, Haoran Wang, YanJie Gao, Shi Lingling
Title: Supervised Edge Attention Network for Accurate Image Instance Segmentation
Abstract:
Effectively keeping boundary of the mask complete is important in instance segmentation. In this task, many works segment instance based on a bounding box from the box head, which means the quality of the detection also affects the completeness of the mask. To circumvent this issue, we propose a fully convolutional box head and a supervised edge attention module in mask head. The box head contains one new IoU prediction branch. It learns association between object features and detected bounding boxes to provide more accurate bounding boxes for segmentation. The edge attention module utilizes attention mechanism to highlight object and suppress background noise, and a supervised branch is devised to guide the network to focus on the edge of instances precisely. To evaluate the effectiveness, we conduct experiments on COCO dataset. Without bells and whistles, our approach achieves impressive and robust improvement compared to baseline models. Code is at https://github.com//IPIU-detection/SEANet."



Paperid:1231
Authors:Jian Hu, Hongya Tuo, Chao Wang, Lingfeng Qiao, Haowen Zhong, Junchi Yan, Zhongliang Jing, Henry Leung
Title: Discriminative Partial Domain Adversarial Network
Abstract:
Domain adaptation (DA) has been a fundamental building block for Transfer Learning (TL) which assumes that source and target domain share the same label space. A more general and realistic setting is that the label space of target domain is a subset of the source domain, as termed by Partial domain adaptation (PDA). Previous methods typically match the whole source domain to target domain, which causes negative transfer due to the source-negative classes in source domain that does not exist in target domain. In this paper, a novel Discriminative Partial Domain Adversarial Network (DPDAN) is developed. We first propose to use hard binary weighting to differentiate the source-positive and source-negative samples in the source domain. The source-positive samples are those with labels shared by two domains, while the rest in the source domain are treated as source-negative samples. Based on the above binary relabeling strategy, our algorithm maximizes the distribution divergence between source-negative samples and all the others (source-positive and target samples), meanwhile minimizes domain shift between source-positive samples and target domain to obtain discriminative domain-invariant features. We empirically verify DPDAN can effectively reduce the negative transfer caused by source-negative classes, and also theoretically show it decreases negative transfer caused by domain shift. Experiments on four benchmark domain adaptation datasets show DPDAN consistently outperforms state-of-the-art methods."



Paperid:1232
Authors:John Janiczek, Parth Thaker, Gautam Dasarathy, Christopher S. Edwards , Philip Christensen, Suren Jayasuriya
Title: Differentiable Programming for Hyperspectral Unmixing using a Physics-based Dispersion Model
Abstract:
Hyperspectral unmixing is an important remote sensing task with applications including material identification and analysis. Characteristic spectral features make many pure materials identifiable from their visible-to-infrared spectra, but quantifying their presence within a mixture is a challenging task due to nonlinearities and factors of variation. In this paper, spectral variation is considered from a physics-based approach and incorporated into an end-to-end spectral unmixing algorithm via differentiable programming. The dispersion model is introduced to simulate realistic spectral variation, and an efficient method to fit the parameters is presented. Then, this dispersion model is utilized as a generative model within an analysis-by-synthesis spectral unmixing algorithm. Further, a technique for inverse rendering using a convolutional neural network to predict parameters of the generative model is introduced to enhance performance and speed when training data is available. Results achieve state-of-the-art on both infrared and visible-to-near-infrared (VNIR) datasets, and show promise for the synergy between physics-based models and deep learning in hyperspectral unmixing in the future."



Paperid:1233
Authors:Xiao Shi, Chenxue Yang, Xue Xia, Xiujuan Chai
Title: Deep Cross-species Feature Learning for Animal Face Recognition via Residual Interspecies Equivariant Network
Abstract:
Although human face recognition has achieved exceptional success driven by deep learning, animal face recognition (AFR) is still a research field that received less attention. Due to the big challenge in collecting large-scale animal face datasets, it is difficult to train a high-precision AFR model from scratch. In this work, we propose a novel Residual InterSpecies Equivariant Network (RiseNet) to deal with the animal face recognition task with limited training samples. First, we formulate a module called residual inter-species feature equivariant to make the feature distribution of animals face closer to the human. Second, according to the structural characteristic of animal face, the features of the upper and lower half faces are learned separately. We present an animal facial feature fusion module to treat the features of the lower half face as additional information, which improves the proposed RiseNet performance. Besides, an animal face alignment strategy is designed for the preprocessing of the proposed network, which further aligns with the human face image. Extensive experiments on two benchmarks show that our method is effective and outperforms the state-of-the-arts."



Paperid:1234
Authors:Liang Liao, Jing Xiao, Zheng Wang, Chia-Wen Lin, Shin’ichi Satoh
Title: Guidance and Evaluation: Semantic-Aware Image Inpainting for Mixed Scenes
Abstract:
Completing a corrupted image with correct structures and reasonable textures for a mixed scene remains an elusive challenge. Since the missing hole in a mixed scene of a corrupted image often contains various semantic information, conventional two-stage approaches utilizing structural information often lead to the problem of unreliable structural prediction and ambiguous image texture generation. In this paper, we propose a Semantic Guidance and Evaluation Network (SGE-Net) to iteratively update the structural priors and the inpainted image in an interplay framework of semantics extraction and image inpainting. It utilizes semantic segmentation map as guidance in each scale of inpainting, under which location-dependent inferences are re-evaluated, and, accordingly, poorly-inferred regions are refined in subsequent scales. Extensive experiments on real-world images of mixed scenes demonstrated the superiority of our proposed method over state-of-the-art approaches, in terms of clear boundaries and photo-realistic textures."



Paperid:1235
Authors:Moitreya Chatterjee, Anoop Cherian
Title: Sound2Sight: Generating Visual Dynamics from Sound and Context
Abstract:
Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality may be missing during inference. In this paper, we study this problem in the context of audio-conditioned visual synthesis -- a task that is important, for example, in occlusion reasoning. Specifically, our goal is to generate future video frames and their motion dynamics conditioned on audio and a few past frames. To tackle this problem, we present Sound2Sight, a deep variational framework, that is trained to learn a per frame stochastic prior conditioned on a joint embedding of audio and past frames. This embedding is learned via a multi-head attention-based audio-visual transformer encoder. The learned prior is then sampled to further condition a video forecasting module to generate future frames. The stochastic prior allows the model to sample multiple plausible futures that are consistent with the provided audio and the past context. Moreover, to improve the quality and coherence of the generated frames, we propose a multimodal discriminator that differentiates between a synthesized and a real audio-visual clip. We empirically evaluate our approach, vis-\'a-vis closely-related prior methods, on two new datasets viz. (i) Multimodal Stochastic Moving MNIST with a Surprise Obstacle, (ii) Youtube Paintings; as well as on the existing Audio-Set Drums dataset. Our extensive experiments demonstrate that Sound2Sight significantly outperforms the state of the art in the generated video quality, while also producing diverse video content."



Paperid:1236
Authors:Jin Hyeok Yoo, Yecheol Kim, Jisong Kim, Jun Won Choi
Title: 3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection
Abstract:
In this paper, we propose a new deep architecture for fusing camera and LiDAR sensors for 3D object detection. Because the camera and LiDAR sensor signals have different characteristics and distributions, fusing these two modalities is expected to improve both the accuracy and robustness of 3D object detection. One of the challenges presented by the fusion of cameras and LiDAR is that the spatial feature maps obtained from each modality are represented by significantly different views in the camera and world coordinates; hence, it is not an easy task to combine two heterogeneous feature maps without loss of information. To address this problem, we propose a method called 3D-CVF that combines the camera and LiDAR features using the cross-view spatial feature fusion strategy. First, the method employs auto-calibrated projection, to transform the 2D camera features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the bird's eye view (BEV) domain. Then, a gated feature fusion network is applied to use the spatial attention maps to mix the camera and LiDAR features appropriately according to the region. Next, camera-LiDAR feature fusion is also achieved in the subsequent proposal refinement stage. The low-level LiDAR features and camera features are separately pooled using region of interest (RoI)-based feature pooling and fused with the joint camera-LiDAR features for enhanced proposal refinement. Our evaluation, conducted on the KITTI and nuScenes 3D object detection datasets, demonstrates that the camera-LiDAR fusion offers significant performance gain over the LiDAR-only baseline and that the proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark."



Paperid:1237
Authors:Karishma Sharma, Pinar Donmez, Enming Luo, Yan Liu, I. Zeki Yalniz
Title: NoiseRank: Unsupervised Label Noise Reduction with Dependence Models
Abstract:
Label noise is increasingly prevalent in datasets acquired from noisy channels. Existing approaches that detect and remove label noise generally rely on some form of supervision, which is not scalable and error-prone. In this paper, we propose NoiseRank, for unsupervised label noise reduction using Markov Random Fields (MRF). We construct a dependence model to estimate the posterior probability of an instance being incorrectly labeled given the dataset, and rank instances based on their estimated probabilities. Our method i) does not require supervision from ground-truth labels or priors on label or noise distribution, ii) is interpretable by design, enabling transparency in label noise removal, iii) is agnostic to classifier architecture/optimization framework and content modality. These advantages enable wide applicability in real noise settings, unlike prior works constrained by one or more conditions. NoiseRank improves state-of-the-art classification on Food101-N ($\sim$20% noise), and is effective on high noise Clothing-1M ($\sim$40% noise)."



Paperid:1238
Authors:Seobin Park, Jinsu Yoo, Donghyeon Cho, Jiwon Kim, Tae Hyun Kim
Title: Fast Adaptation to Super-Resolution Networks via Meta-Learning
Abstract:
Conventional supervised super-resolution (SR) approaches are trained with massive external SR datasets but fail to exploit desirable properties of the given test image.On the other hand, self-supervised SR approaches utilize the internal information within a test image but suffer from computational complexity in run-time. In this work, we observe the opportunity for further improvement of the performance of SISR without changing the architecture of conventional SR networks by practically exploiting additional information given from the input image. In the training stage, we train the network via meta-learning; thus, the network can quickly adapt to any input image at test time. Then, in the test stage, parameters of this meta-learned network are rapidly fine-tuned with only a few iterations by only using the given low-resolution image. The adaptation at the test time takes full advantage of patch-recurrence property observed in natural images. Our method effectively handles unknown SR kernels and can be applied to any existing model. We demonstrate that the proposed model-agnostic approach consistently improves the performance of conventional SR networks on various benchmark SR datasets."



Paperid:1239
Authors:Siyu Huang, Fangbo Qin, Pengfei Xiong, Ning Ding, Yijia He, Xiao Liu
Title: TP-LSD: Tri-Points Based Line Segment Detector
Abstract:
This paper proposes a novel deep convolutional model, Tri-Points Based Line Segment Detector (TP-LSD), to detect line segments in an image at real-time speed. The previous related methods typically use the two-step strategy, relying on either heuristic post-process or extra classifier. To realize one-step detection with a faster and more compact model, we introduce the tri-points representation, converting the line segment detection to the end-to-end prediction of a root-point and two endpoints for each line segment. TP-LSD has two branches: tri-points extraction branch and line segmentation branch. The former predicts the heat map of root-points and the two displacement maps of endpoints. The latter segments the pixels on straight lines out from background. Moreover, the line segmentation map is reused in the first branch as structural prior. We propose an additional novel evaluation metric and evaluate our method on Wireframe and YorkUrban datasets, demonstrating not only the competitive accuracy compared to the most recent methods, but also the real-time run speed up to extbf{78 FPS} with the 320 × 320 input."



Paperid:1240
Authors:Chenfeng Xu, Bichen Wu, Zining Wang, Wei Zhan, Peter Vajda, Kurt Keutzer, Masayoshi Tomizuka
Title: SqueezeSegV3: Spatially-Adaptive Convolution for Efficient Point-Cloud Segmentation
Abstract:
LiDAR point-cloud segmentation is an important problem for many applications. For large-scale point cloud segmentation, the extit{de facto} method is to project a 3D point cloud to get a 2D LiDAR image and use convolutions to process it. Despite the similarity between regular RGB and LiDAR images, we are the first to discover that the feature distribution of LiDAR images changes drastically at different image locations. Using standard convolutions to process such LiDAR images is problematic, as convolution filters pick up local features that are only active in specific regions in the image. As a result, the capacity of the network is under-utilized and the segmentation performance decreases. To fix this, we propose Spatially-Adaptive Convolution (SAC) to adopt different filters for different locations according to the input image. SAC can be computed efficiently since it can be implemented as a series of element-wise multiplications, im2col, and standard convolution. It is a general framework such that several previous methods can be seen as special cases of SAC. Using SAC, we build SqueezeSegV3 for LiDAR point-cloud segmentation and outperform all previous published methods by at least 2.0\% mIoU on the SemanticKITTI benchmark. Code and pretrained model are avalibale at \url{https://github.com/chenfengxu714/SqueezeSegV3}."



Paperid:1241
Authors:Zilong Ji, Xiaolong Zou, Xiaohan Lin, Xiao Liu, Tiejun Huang, Si Wu
Title: An Attention-driven Two-stage Clustering Method for Unsupervised Person Re-Identification
Abstract:
The progressive clustering method and its variants, which iteratively generate pseudo labels for unlabeled data and perform feature learning, have shown great process in unsupervised person re-identification (re-id). However, they have an intrinsic problem of modeling the in-camera variability of images successfully, that is, pedestrian features extracted from the same camera tend to be clustered into the same class. This often results in a non-convergent model in the real world application of clustering based re-id models, leading to degenerated performance. In the present study, we propose an attention-driven two-stage clustering (ADTC) method to solve this problem. Specifically, our method consists of two strategies. Firstly, we use an unsupervised attention kernel to shift the learned features from the image background to the pedestrian foreground, which results in more informative clusters. Secondly, to aid the learning of the attention driven clustering model, we separate the clustering process into two stages. We first use kmeans to generate the centroids of clusters (stage 1) and then apply the k-reciprocal Jaccard distance (KRJD) metric to re-assign data points to each cluster (stage 2). By iteratively learning with the two strategies, the attentive regions are gradually shifted from the background to the foreground and the features become more discriminative. Using two benchmark datasets Market1501 and DukeMTMC, we demonstrate that our model outperforms other state-of-the-art unsupervised approaches for person re-id."



Paperid:1242
Authors:Jun Ling, Han Xue, Li Song, Shuhui Yang, Rong Xie, Xiao Gu
Title: Toward Fine-grained Facial Expression Manipulation
Abstract:
Facial expression manipulation aims at editing facial expression with a given condition. Previous methods edit an input image under the guidance of a discrete emotion label or absolute condition (e.g., facial action units) to possess the desired expression. However, these methods either suffer from changing condition-irrelevant regions or are inefficient for fine-grained editing. In this study, we take these two objectives into consideration and propose a novel method. First, we replace continuous absolute condition with relative condition, specifically, relative action units. With relative action units, the generator learns to only transform regions of interest which are specified by non-zero-valued relative AUs. Second, our generator is built on U-Net but strengthened by multi-scale feature fusion (MSF) mechanism for high-quality expression editing purposes. Extensive experiments on both quantitative and qualitative evaluation demonstrate the improvements of our proposed approach compared with the state-of-the-art expression editing methods. "



Paperid:1243
Authors:Zhen Zhao, Yuhong Guo, Haifeng Shen, Jieping Ye
Title: Adaptive Object Detection with Dual Multi-Label Prediction
Abstract:
In this paper, we propose a novel end-to-end unsupervised deep domain adaptation model for adaptive object detection by exploiting multi-label object recognition as a dual auxiliary task. The model exploits multi-label prediction to reveal the object category information in each image and then uses the prediction results to perform conditional adversarial global feature alignment, such that the multimodal structure of image features can be tackled to bridge the domain divergence at the global feature level while preserving the discriminability of the features. Moreover, we introduce a prediction consistency regularization mechanism to assist object detection, which uses the multi-label prediction results as an auxiliary regularization information to ensure consistent object category discoveries between the object recognition task and the object detection task. Experiments are conducted on a few benchmark datasets and the results show the proposed model outperforms the state-of-the-art comparison methods."



Paperid:1244
Authors:Sachin Raja, Ajoy Mondal, C V Jawahar
Title: Table Structure Recognition using Top-Down and Bottom-Up Cues
Abstract:
Tables are information-rich structured objects in document images. While significant work has been done in localizing tables as graphic objects in document images, only limited attempts exist on table structure recognition. Most existing literature on structure recognition depends on extraction of meta-features from the {\sc pdf} document or on the optical character recognition (OCR) models to extract low-level layout features from the image. However, these methods fail to generalize well because of the absence of meta-features or errors made by the OCR when there is a significant variance in table layouts and text organization. In our work, we focus on tables that have complex structures, dense content, and varying layouts with no dependency on meta-features and/or OCR.We present an approach for table structure recognition that combines cell detection and interaction modules to localize the cells and predict their row and column associations with other detected cells. We incorporate structural constraints as additional differential components to the loss function for cell detection. We empirically validate our method on the publicly available real-world datasets - ICDAR-2013, ICDAR-2019, cTDaR archival, UNLV, SciTSR, SciTSR-COMP, TableBank, and PubTabNet. Our attempt opens up a new direction for table structure recognition by combining top-down (table cells detection) and bottom-up (structure recognition) cues in visually understanding the tables."



Paperid:1245
Authors:Mingyu Yin, Li Sun, Qingli Li
Title: Novel View Synthesis on Unpaired Data by Conditional Deformable Variational Auto-Encoder
Abstract:
Novel view synthesis often requires to have the paired data from both the source and target views. This paper proposes a view translation model within cVAE-GAN framework for the purpose of unpaired training. We design a conditional deformable module (CDM) which uses the source (or target) view condition vectors as the filters to convolve the feature maps from the main branch. It generates several pairs of displacement maps like the 2D optical flows. The flows then deform the features, and the results are given to the main branch of the encoder (or decoder) by the deformed feature based normalization module (DFNM). DFNM scales and offsets the feature maps in the main branch given its deformed input from the side branch. With the CDM and DFNM, the encoder outputs a view-irrelevant posterior, while the decoder takes the sample drawn from it to synthesize the reconstructed and the view-translated images. To further ensure the disentanglement between the views and other factors, we add adversarial training on the code drawn from the view-irrelevant posterior. The results and the ablation study on MultiPIE and 3D chair datasets validate the effectiveness of the whole framework and the designed module."



Paperid:1246
Authors:Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Batra, Stefan Lee
Title: Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments
Abstract:
We develop a language-guided navigation task set in a continuous 3D environment where agents must execute low-level actions to follow natural language navigation directions. By being situated in continuous environments, this setting lifts a number of assumptions implicit in prior work that represents environments as a sparse graph of panoramas with edges corresponding to navigability. Specifically, our setting drops the presumptions of known environment topologies, short-range oracle navigation, and perfect agent localization. To contextualize this new task, we develop models that mirror many of the advances made in prior settings as well as single-modality baselines. While some transfer, we find significantly lower absolute performance in the continuous setting – suggesting that performance in prior ‘navigation-graph’ settings may be inflated by the strong implicit assumptions."



Paperid:1247
Authors:Yueran Bai, Yingying Wang, Yunhai Tong, Yang Yang, Qiyue Liu, Junhui Liu
Title: Boundary Content Graph Neural Network for Temporal Action Proposal Generation
Abstract:
Temporal action proposal generation plays an important role in video action understanding, which requires localizing high-quality action content precisely. However, generating temporal proposals with both precise boundaries and high-quality action content is extremely challenging. To address this issue, we propose a novel Boundary Content Graph Neural Network (BC-GNN) to model the insightful relations between the boundary and action content of temporal proposals by the graph neural networks. In BC-GNN, the boundaries and content of temporal proposals are taken as the nodes and edges of the graph neural network, respectively, where they are spontaneously linked. Then a novel graph computation operation is proposed to update features of edges and nodes. After that, one updated edge and two nodes it connects are used to predict boundary probabilities and content confidence score, which will be combined to generate a final high-quality proposal. Experiments are conducted on two mainstream datasets: ActivityNet-1.3 and THUMOS14. Without the bells and whistles, BC-GNN outperforms previous state-of-the-art methods in both temporal action proposal and temporal action detection tasks."



Paperid:1248
Authors:Yunhao Ge, Jiaping Zhao, Laurent Itti
Title: Pose Augmentation: Class-agnostic Object Pose Transformation for Object Recognition
Abstract:
Object pose increases interclass object variance which makes object recognition from 2D images harder. To render a classifier robust to pose variations, most deep neural networks try to eliminate the influence of pose by using large datasets with many poses for each class. Here, we propose a different approach: a class-agnostic object pose transformation network (OPT-Net) can transform an image along 3D yaw and pitch axes to synthesize additional poses continuously. Synthesized images lead to better training of an object classifier. We design a novel eliminate-add structure to explicitly disentangle pose from object identity: first ‘eliminate’ pose information of the input image and then ‘add’ target pose information (regularized as continuous variables) to synthesize any target pose. We trained OPT-Net on images of toy vehicles shot on a turntable from the iLab-20M dataset. After training on unbalanced discrete poses (5 classes with 6 poses per object instance, plus 5 classes with only 2 poses), we show that OPT-Net can synthesize balanced continuous new poses along yaw and pitch axes with high quality. Training a ResNet-18 classifier with original plus synthesized poses improves mAP accuracy by 9% over training on original poses only. Further, the pre-trained OPT-Net can generalize to new object classes, which we demonstrate on both iLab-20M and RGB-D. We also show that the learned features can generalize to ImageNet. (The code is released at this URL)"



Paperid:1249
Authors:Minuk Ma, Sunjae Yoon, Junyeong Kim, Youngjoon Lee, Sunghun Kang, Chang D. Yoo
Title: VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval
Abstract:
Video Moment Retrieval (VMR) is a task to localize the temporal moment in untrimmed video specified by natural language query. For VMR, several methods that require full supervision for training have been proposed. Unfortunately, acquiring a large number of training videos with labeled temporal boundaries for each query is a labor-intensive process. This paper explores a method for performing VMR in a weakly-supervised manner (wVMR): training is performed without temporal moment labels but only with the text query that describes a segment of the video. Existing methods on wVMR generate multi-scale proposals and apply query-guided attention mechanism to highlight the most relevant proposal. To leverage the weak supervision, contrastive learning is used which predicts higher scores for the correct video-query pairs than for the incorrect pairs. It has been observed that a large number of candidate proposals, coarse query representation, and one-way attention mechanism lead to blurry attention map which limits the localization performance. To address this issue, Video-Language Alignment Network (VLANet) is proposed that learns a sharper attention by pruning out spurious candidate proposals and applying a multi-directional attention mechanism with fine-grained query representation. The Surrogate Proposal Selection module selects a proposal based on the proximity to the query in the joint embedding space, and thus substantially reduces candidate proposals which leads to lower computation load and sharper attention. Next, the Cascaded Cross-modal Attention module considers dense feature interactions and multi-directional attention flows to learn the multi-modal alignment. VLANet is trained end-to-end using contrastive loss which enforces semantically similar videos and queries to cluster. The experiments show that the method achieves state-of-the-art performance on Charades-STA and DiDeMo datasets."



Paperid:1250
Authors:Albert Gordo, Filip Radenovic, Tamara Berg
Title: Attention-Based Query Expansion Learning
Abstract:
Query expansion is a technique widely used in image search consisting in combining highly ranked images from an original query into an expanded query that is then reissued, generally leading to increased recall and precision. An important aspect of query expansion is choosing an appropriate way to combine the images into a new query. Interestingly, despite the undeniable empirical success of query expansion, ad-hoc methods with different caveats have dominated the landscape, and not a lot of research has been done on learning how to do query expansion. In this paper we propose a more principled framework to query expansion, where one trains, in a discriminative manner, a model that learns how images should be aggregated to form the expanded query. Within this framework, we propose a model that leverages a self-attention mechanism to effectively learn how to transfer information between the different images before aggregating them. Our approach obtains higher accuracy than existing approaches on standard benchmarks. More importantly, our approach is the only one that consistently shows high accuracy under different regimes, overcoming caveats of existing methods."



Paperid:1251
Authors:Boren Li, Po-Yu Zhuang, Jian Gu, Mingyang Li, Ping Tan
Title: Interpretable Foreground Object Search As Knowledge Distillation
Abstract:
This paper proposes a knowledge distillation method for foreground object search (FoS). Given a background and a rectangle specifying the foreground location and scale, FoS retrieves compatible foregrounds in a certain category for later image composition. Foregrounds within the same category can be grouped into a small number of patterns. Instances within each pattern are compatible with any query input interchangeably. These instances are referred to as interchangeable foregrounds. We first present a pipeline to build pattern-level FoS dataset containing labels of interchangeable foregrounds. We then establish a benchmark dataset for further training and testing following the pipeline. As for the proposed method, we first train a foreground encoder to learn representations of interchangeable foregrounds. We then train a query encoder to learn query-foreground compatibility following a knowledge distillation framework. It aims to transfer knowledge from interchangeable foregrounds to supervise representation learning of compatibility. The query feature representation is projected to the same latent space as interchangeable foregrounds, enabling very efficient and interpretable instance-level search. Furthermore, pattern-level search is feasible to retrieve more controllable, reasonable and diverse foregrounds. The proposed method outperforms the previous state-of-the-art by $10.42\%$ in absolute difference and $24.06\%$ in relative improvement evaluated by mean average precision (mAP). Extensive experimental results also demonstrate its efficacy from various aspects. The benchmark dataset and code will be release shortly."



Paperid:1252
Authors:Zailiang Chen, Xianxian Zheng, Hailan Shen, Ziyang Zeng, Yukun Zhou, Rongchang Zhao
Title: Improving Knowledge Distillation via Category Structure
Abstract:
Most previous knowledge distillation frameworks train the student to mimic the teacher's output of each sample or transfer cross-sample relations from the teacher to the student. Nevertheless, they neglect the structured relations at a category level. In this paper, a novel Category Structure is proposed to transfer category-level structured relations for knowledge distillation. It models two structured relations, including the intra-category structure and the inter-category structure, which are intrinsic natures in relations between samples. Intra-category structure penalizes the structured relations in samples from the same category and inter-category structure focuses on cross-category relations at a category level. Transferring category structure from the teacher to the student supplements category-level structured relations for training a better student. Extensive experiments show that our method groups samples from the same category tighter in the embedding space and the superiority of our method in comparison with closely related works are validated in different datasets and models."



Paperid:1253
Authors:Stephan J. Garbin, Marek Kowalski, Matthew Johnson, Jamie Shotton
Title: High Resolution Zero-Shot Domain Adaptation of Synthetically Rendered Face Images
Abstract:
Generating photorealistic images of human faces at scale remains a prohibitively difficult task using computer graphics approaches. This is because these require the simulation of light to be photorealistic, which in turn requires physically accurate modelling of geometry, materials, and light sources, for both the head and the surrounding scene. Non-photorealistic renders however are increasingly easy to produce. In contrast to computer graphics approaches, generative models learned from more readily available 2D image data have been shown to produce samples of human faces that are hard to distinguish from real data. The process of learning usually corresponds to a loss of control over the shape and appearance of the generated images. For instance, even simple disentangling tasks such as modifying the hair independently of the face, which is trivial to accomplish in a computer graphics approach, remains an open research question. In this work, we propose an algorithm that matches a non-photorealistic, synthetically generated image to a latent vector of a pretrained StyleGAN2 model which, in turn, maps the vector to a photorealistic image of a person of the same pose, expression, hair, and lighting. In contrast to most previous work, we require no synthetic training data. To the best of our knowledge, this is the first algorithm of its kind to work at a resolution of 1K and represents a significant leap forward in visual realism."



Paperid:1254
Authors:Fangyu Wu, Jeremy S.Smith, Wenjin Lu, Chaoyi Pang, Bailing Zhang
Title: Attentive Prototype Few-shot Learning with Capsule Network-based Embedding
Abstract:
Few-shot learning, namely recognizing novel categories with a very small amount of training examples, is a challenging area of machine learning research. Traditional deep learning methods require massive training data to tune the huge number of parameters, which is often impractical and prone to over-fitting. In this work, we further research on the well-known few-shot learning method known as prototypical networks for better performance. Our contributions include (1) a new embedding structure to encode relative spatial relationships between features by applying a capsule network; (2) a new triplet loss designated to enhance the semantic feature embedding where similar samples are close to each other while dissimilar samples are farther apart; and (3) an effective non-parametric classifier termed attentive prototypes in place of the simple prototypes in current few-shot learning. The proposed attentive prototype aggregates all of the instances in a support class which are weighted by their importance, defined by the reconstruction error for a given query. The reconstruction error allows the classification posterior probability to be estimated, which corresponds to the classification confidence score. Extensive experiments on three benchmark datasets demonstrate that our approach is effective for the few-shot classification task."



Paperid:1255
Authors:Aditya Arun, C.V. Jawahar, M. Pawan Kumar
Title: Weakly Supervised Instance Segmentation by Learning Annotation Consistent Instances
Abstract:
Recent approaches for weakly supervised instance segmentations depend on two components: (i) a pseudo label generation model that provides instances which are consistent with a given annotation; and (ii) an instance segmentation model, which is trained in a supervised manner using the pseudo labels as ground-truth. Unlike previous approaches, we explicitly model the uncertainty in the pseudo label generation process using a conditional distribution. The samples drawn from our conditional distribution provide accurate pseudo labels due to the use of semantic class aware unary terms, boundary aware pairwise smoothness terms, and annotation aware higher order terms. Furthermore, we represent the instance segmentation model as an annotation agnostic prediction distribution. In contrast to previous methods, our representation allows us to define a joint probabilistic learning objective that minimizes the dissimilarity between the two distributions. Our approach achieves state of the art results on the PASCAL VOC 2012 data set, outperforming the best baseline by 4.2% mAP@0.5 and 4.8% mAP@0.75."



Paperid:1256
Authors:Yao Zhou, Guowei Wan, Shenhua Hou, Li Yu, Gang Wang, Xiaofei Rui, Shiyu Song
Title: DA4AD: End-to-End Deep Attention-based Visual Localization for Autonomous Driving
Abstract:
We present a visual localization framework based on novel deep attention aware features for autonomous driving that achieves centimeter level localization accuracy. Conventional approaches to the visual localization problem rely on handcrafted features or human-made objects on the road. They are known to be either prone to unstable matching caused by severe appearance or lighting changes, or too scarce to deliver constant and robust localization results in challenging scenarios. In this work, we seek to exploit the deep attention mechanism to search for salient, distinctive and stable features that are good for long-term matching in the scene through a novel end-to-end deep neural network. Furthermore, our learned feature descriptors are demonstrated to be competent to establish robust matches and therefore successfully estimate the optimal camera poses with high precision. We comprehensively validate the effectiveness of our method using a freshly collected dataset with high-quality ground truth trajectories and hardware synchronization between sensors. Results demonstrate that our method achieves a competitive localization accuracy when compared to the LiDAR-based localization solutions under various challenging circumstances, leading to a potential low-cost localization solution for autonomous driving."



Paperid:1257
Authors:Duc Minh Vo, Akihiro Sugimoto
Title: Visual-Relation Conscious Image Generation from Structured-Text
Abstract:
We propose an end-to-end network for image generation from given structured-text that consists of the visual-relation layout module and the pyramid of GANs, namely stacking-GANs. Our visual-relation layout module uses relations among entities in the structured-text in two ways: comprehensive usage and individual usage. We comprehensively use all available relations together to localize initial bounding-boxes of all the entities. We also use individual relation separately to predict from the initial bounding-boxes relation-units for all the relations in the input text. We then unify all the relation-units to produce the visual-relation layout, i.e., bounding-boxes for all the entities so that each of them uniquely corresponds to each entity while keeping its involved relations. Our visual-relation layout reflects the scene structure given in the input text. The stacking-GANs is the stack of three GANs conditioned on the visual-relation layout and the output of previous GAN, consistently capturing the scene structure. Our network realistically renders entities' details in high resolution while keeping the scene structure. Experimental results on two public datasets show outperformances of our method against state-of-the-art methods."



Paperid:1258
Authors:Lianli Gao, Qilong Zhang, Jingkuan Song, Xianglong Liu, Heng Tao Shen
Title: Patch-wise Attack for Fooling Deep Neural Network
Abstract:
By adding human-imperceptible noise to clean images, the resultant adversarial examples can fool other unknown models. Features of a pixel extracted by deep neural networks (DNNs) are influenced by its surrounding regions, and different DNNs generally focus on different discriminative regions in recognition. Motivated by this, we propose a \underline{patch}-wise iterative algorithm -- a black-box attack towards mainstream normally trained and defense models, which differs from the existing attack methods manipulating \underline{pixel}-wise noise. In this way, without sacrificing the performance of the substitute model, our adversarial examples can have strong transferability. Specifically, we introduce an amplification factor to the step size in each iteration, and one pixel's overall gradient overflowing the $psilon$-constraint is properly assigned to its surrounding regions by a projection kernel. Our method can be generally integrated to any gradient-based attack methods. Compared with the current state-of-the-art attacks, we significantly improve the success rate by 9.2\% for defense models and 3.7\% for normally trained models on average. Our anonymous code is available at \url{http://tiny.cc/da7vkz}"



Paperid:1259
Authors:Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, Qianru Sun
Title: Feature Pyramid Transformer
Abstract:
Feature interactions across space and scales underpin modern visual recognition systems because they introduce beneficial visual contexts. Conventionally, spatial contexts are passively hidden in the CNN's increasing receptive fields or actively encoded by non-local convolution. Yet, the non-local spatial interactions are not across scales, and thus they fail to capture the non-local contexts of objects (or parts) residing in different scales. To this end, we propose a fully active feature interaction across both space and scales, called Feature Pyramid Transformer (FPT). It transforms any feature pyramid into another feature pyramid of the same size but with richer contexts, by using three specially designed transformers in self-level, top-down, and bottom-up interaction fashion. FPT serves as a generic visual backbone with fair computational overhead. We conduct extensive experiments in both instance-level (i.e., object detection and instance segmentation) and pixel-level segmentation tasks, using various backbones and head networks, and observe consistent improvement over all the baselines and the state-of-the-art methods."



Paperid:1260
Authors:Jiabin Xing, Zhi Qi, Jiying Dong, Jiaxuan Cai, Hao Liu
Title: MABNet: A Lightweight Stereo Network Based on Multibranch Adjustable Bottleneck Module
Abstract:
Recently, end-to-end CNNs have presented remarkable performance for disparity estimation. But most of them are too heavy to resource-constrained devices, because of enormous parameters necessary for satisfactory results. To address the issue, we propose two compact stereo networks, MABNet and its light version MABNet_tiny. MABNet is based on a novel Multibranch Adjustable Bottleneck (MAB) module, which is less demanding on parameters and computation. In a MAB module, feature map is split into various parallel branches, where the depthwise separable convolutions with different dilation rates extract features with multiple receptive fields however at an affordable computational budget. Besides, the number of channels in each branch is adjustable independently to tradeoff computation and accuracy. On SceneFlow and KITTI datasets, our MABNet achieves competitive accuracy with fewer parameters of 1.65M. Especially, MABNet_tiny reduces the parameters to 47K by cutting down the channels and layers in MABNet."



Paperid:1261
Authors:Lingxiao He, Wu Liu
Title: Guided Saliency Feature Learning for Person Re-identification in Crowded Scenes
Abstract:
Person Re-identification (Re-ID) in crowed scenes is a challenging problem, where people are frequently partially occluded by objects and other people. However, few studies have provided flexible solutions to re-identifying people in an image containing a partial occlusion body part. In this paper, we propose a simple occlusion-aware approach to address the problem. The proposed method first leverages a fully convolutional network to generate spatial features. And then we design a combination of a pose-guided and mask-guided layer to generate saliency heatmap to further guide discriminative feature learning. More importantly, we propose a new matching approach, called Guided Adaptive Spatial Matching (GASM), which expects that each spatial feature in the query can find the most similar spatial features of a person in a gallery to match. Especially, We use the saliency heatmap to guide the adaptive spatial matching by assigning the foreground human parts with larger weights adaptively. The effectiveness of the proposed GASM is demonstrated on two occluded person datasets: Crowd REID (51.52\%) and Occluded REID (80.25\%) and three benchmark person datasets: Market1501 (95.31\%), DukeMTMC-reID (88.12\%) and MSMT17 (79.52\%). Additionally, GASM achieves good performance on cross-domain person Re-ID. The code and models are available at https://github.com/JDAI-CV/fast-reid/blob/master/projects/CrowdReID."



Paperid:1262
Authors:Miao Zhang, Sun Xiao Fei, Jie Liu, Shuang Xu, Yongri Piao, Huchuan Lu
Title: Asymmetric Two-Stream Architecture for Accurate RGB-D Saliency Detection
Abstract:
Most existing RGB-D saliency detection methods adopt symmetric two-stream architectures for learning discriminative RGB and depth representations. In fact, there is another level of ambiguity that is often overlooked: if RGB and depth data are necessary to fit into the same network. In this paper, we propose an asymmetric two-stream architecture taking account of the inherent differences between RGB and depth data for saliency detection. First, we design a flow ladder module (FLM) for the RGB stream to fully extract global and local information while maintaining the saliency details. This is achieved by constructing four detail-transfer branches, each of which preserves the detail information and receives global location information from representations of other vertical parallel branches in an evolutionary way. Second, we propose a novel depth attention module (DAM) to ensure depth features with high discriminative power in location and spatial structure being effectively utilized when combined with RGB features in challenging scenes. The depth features can also discriminatively guide the RGB features via our proposed DAM to precisely locate the salient objects. Extensive experiments demonstrate that our method achieves superior performance over 13 state-of-the-art RGB-D approaches on the 7 datasets. Our code will be publicly available."



Paperid:1263
Authors:Youcheng Sun, Hana Chockler, Xiaowei Huang, Daniel Kroening
Title: Explaining Image Classifiers using Statistical Fault Localization
Abstract:
The black-box nature of deep neural networks (DNNs) makes it impossible to understand why a particular output is produced, creating demand for “Explainable AI”. In this paper, we show that statistical fault localization (SFL) techniques from software engineering deliver high quality explanations of the outputs of DNNs, where we define an explanation as a minimal subset of features sufficient for making the same decision as for the original input. We present an algorithm and a tool called DeepCover, which synthesizes a ranking of the features of the inputs using SFL and constructs explanations for the decisions of the DNN based on this ranking. We compare explanations produced by DeepCover with those of the state-of-the-art tools GradCAM, LIME, SHAP, RISE and Extremal and show that explanations generated by DeepCover are consistently better across a broad set of experiments. On a benchmark set with known ground truth, DeepCover achieves 76.7% accuracy, which is 6% better than the second best Extremal."



Paperid:1264
Authors:Michal Rolínek, Paul Swoboda, Dominik Zietlow, Anselm Paulus, Vít Musil, Georg Martius
Title: Deep Graph Matching via Blackbox Differentiation of Combinatorial Solvers
Abstract:
Building on recent progress at the intersection of combinatorial optimization and deep learning, we propose an end-to-end trainable architecture for deep graph matching that contains unmodified combinatorial solvers. Using the presence of heavily optimized combinatorial solvers together with some improvements in architecture design, we advance state-of-the-art on deep graph matching benchmarks for keypoint correspondence. In addition, we highlight conceptual advantages of incorporating solvers into deep learning architectures, such as the possibility of post-processing with a strong multi-graph matching solver or the indifference to changes in the training setting. Finally, we propose two new challenging experimental setups."



Paperid:1265
Authors:Simon Jenni, Givi Meishvili, Paolo Favaro
Title: Learning Video Representations by Transforming Time
Abstract:
We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. To learn to distinguish non-trivial motions, the design of the transformations is based on two principles: 1) To define clusters of motions based on time warps of different magnitude; 2) To ensure that the discrimination is feasible only by observing and analyzing as many image frames as possible. Thus, we introduce the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition on UCF101 and HMDB51."



Paperid:1266
Authors:Madhu Vankadari, Sourav Garg, Anima Majumder, Swagat Kumar, Ardhendu Behera
Title: Unsupervised Monocular Depth Estimation for Night-time Images using Adversarial Domain Feature Adaptation
Abstract:
In this paper, we look into the problem of estimating per-pixel depth maps from unconstrained RGB monocular night-time images which is a difficult task that has not been addressed adequately in the literature. The state-of-the-art day-time depth estimation methods fail miserably when tested with night-time images due to a large domain shift between them. The usual photometric losses used for training these networks may not work for night-time images due to the absence of uniform lighting which is commonly present in day-time images, making it a difficult problem to solve. We propose to solve this problem by posing it as a domain adaptation problem where a network trained with day-time images is adapted to work for night-time images. Specifically, an encoder is trained to generate features from night-time images that are indistinguishable from those obtained from day-time images by using a PatchGAN-based adversarial discriminative learning method. Unlike the existing methods that directly adapt depth prediction (network output), we propose to adapt feature maps obtained from the encoder network so that a pre-trained day-time depth decoder can be directly used for predicting depth from these adapted features. Hence, the resulting method is termed as “Adversarial Domain Feature Adaptation (ADFA)” and its efficacy is demonstrated through experimentation on the challenging Oxford night driving dataset. To the best of our knowledge, this work is a first of its kind to estimate depth from unconstrained night-time monocular RGB images that uses a completely unsupervised learning process. The modular encoder-decoder architecture for the proposed ADFA method allows us to use the encoder module as a feature extractor which can be used in many other applications. One such application is demonstrated where the features obtained from our adapted encoder network are shown to outperform other state-of-the-art methods in a visual place recognition problem, thereby, further establishing the usefulness and effectiveness of the proposed approach."



Paperid:1267
Authors:Linlin Chao, Jingdong Chen, Wei Chu
Title: Variational Connectionist Temporal Classification
Abstract:
Connectionist Temporal Classification (CTC) is a training criterion designed for sequence labelling problems where the alignment between the inputs and the target labels is unknown. One of the key steps is to add a blank symbol to the target vocabulary. However, CTC tends to output spiky distributions since it prefers to output blank symbol most of the time. These spiky distributions show inferior alignments and the non-blank symbols are not learned sufficiently. To remedy this, we propose variational CTC (Var-CTC) to enhance the learning of non-blank symbols. The proposed Var-CTC converts the output distribution of vanilla CTC with hierarchy distribution. It first learns the approximated posterior distribution of blank to determine whether to output a specific non-blank symbol or not. Then it learns the alignment between non-blank symbols and input sequence. Experiments on scene text recognition and offline handwritten text recognition show Var-CTC achieves better alignments. Besides, with the enhanced learning of non-blank symbols, the confidence scores of model outputs are more discriminative. Compared with the vanilla CTC, the proposed Var-CTC can improve the recall performance by a large margin when the models maintain the same level of precision."



Paperid:1268
Authors:Congzhentao Huang, Shuai Jiang, Yang Li, Ziyue Zhang, Jason Traish, Chen Deng, Sam Ferguson, Richard Yi Da Xu
Title: End-to-end Dynamic Matching Network for Multi-view Multi-person 3d Pose Estimation
Abstract:
As an important computer vision task, 3d human pose estimation in a multi-camera, multi-person setting has received widespread attention and many interesting applications have been derived from it. Traditional approaches use a 3d pictorial structure model to handle this task. However, these models suffer from high computation costs and result in low accuracy in joint detection. Recently, especially since the introduction of Deep Neural Networks, one popular approach is to build a pipeline that involves three separate steps: (1) 2d skeleton detection in each camera view, (2) identification of matched 2d skeletons and (3) estimation of the 3d poses. Many existing works operate by feeding the 2d images and camera parameters through the three modules in a cascade fashion. However, all three operations can be highly correlated. For example, the 3d generation results may affect the results of detection in step 1, as does the matching algorithm in step 2. To address this phenomenon, we propose a novel end-to-end training scheme that brings the three separate modules into a single model. However, one outstanding problem of doing so is that the matching algorithm in step 2 appears to disjoint the pipeline. Therefore, we take our inspiration from the recent success in Capsule Networks, in which its Dynamic Routing step is also disjointed, but plays a crucial role in deciding how gradients are flowed from the upper to the lower layers. Similarly, a dynamic matching module in our work also decides the paths in which gradients flow from step 3 to step 1. Furthermore, as a large number of cameras are present, the existing matching algorithm either fails to deliver a robust performance or can be very inefficient. Thus, we additionally propose a novel matching algorithm that can match 2d poses from multiple views efficiently. The algorithm is robust and able to deal with situations of incomplete and false 2d detection as well."



Paperid:1269
Authors:Morteza Ghahremani, Bernard Tiddeman, Yonghuai Liu, and Ardhendu Behera
Title: Orderly Disorder in Point Cloud Domain
Abstract:
In the real world, out-of-distribution samples, noise and distortions exist in test data. Existing deep networks developed for point cloud data analysis are prone to overfitting and a partial change in test data leads to unpredictable behaviour of the networks. In this paper, we propose a smart yet simple deep network for analysis of 3D models using `orderly disorder' theory. Orderly disorder is a way of describing the complex structure of disorders within complex systems. Our method extracts the deep patterns inside a 3D object via creating a dynamic link to seek the most stable patterns and at once, throws away the unstable ones. Patterns are more robust to changes in data distribution, especially those that appear in the top layers. Features are extracted via an innovative cloning decomposition technique and then linked to each other to form stable complex patterns. Our model alleviates the vanishing-gradient problem, strengthens dynamic link propagation and substantially reduces the number of parameters. Extensive experiments on challenging benchmark datasets verify the superiority of our light network on the segmentation and classification tasks, especially in the presence of noise wherein our network's performance drops less than 10\% while the state-of-the-art networks fail to work."



Paperid:1270
Authors:Dongdong Chen, Mike E. Davies
Title: Deep Decomposition Learning for Inverse Imaging Problems
Abstract:
Deep learning is emerging as a new paradigm for solving inverse imaging problems. However, the deep learning methods often lack the assurance of traditional physics-based methods due to the lack of physical information considerations in neural network training and deploying. The appropriate supervision and explicit calibration by the information of the physic model can enhance the neural network learning and its practical performance. In this paper, inspired by the geometry that data can be decomposed by two components from the null-space of the forward operator and the range space of its pseudo-inverse, we train neural networks to learn the two components and therefore learn the decomposition, i.e. we explicitly reformulate the neural network layers as learning range-nullspace decomposition functions with reference to the layer inputs, instead of learning unreferenced functions. We empirically show that the proposed framework demonstrates superior performance over recent deep residual learning, unrolled learning and nullspace learning on tasks including compressive sensing medical imaging and natural image super-resolution. Our code is available at https://github.com/edongdongchen/DDN."



Paperid:1271
Authors:Gilles Puy, Alexandre Boulch, Renaud Marlet
Title: FLOT: Scene Flow on Point Clouds guided by Optimal Transport
Abstract:
We propose and study a method called FLOT that estimates scene flow on point clouds. We start the design of FLOT by noticing that scene flow estimation on point clouds reduces to estimating a permutation matrix in a perfect world. Inspired by recent works on graph matching, we build a method to find these correspondences by borrowing tools from optimal transport. Then, we relax the transport constraints to take into account real-world imperfections. The transport cost between two points is given by the pairwise similarity between deep features extracted by a neural network trained under full supervision using synthetic datasets. Our main finding is that FLOT can perform as well as the best existing methods on synthetic and real-world datasets while requiring much less parameters and without using multiscale analysis. Our second finding is that, on the training datasets considered, most of the performance can be explained by the learned transport cost. This yields a simpler method, FLOT$_0$, which is obtained using a particular choice of optimal transport parameters and performs nearly as well as FLOT."



Paperid:1272
Authors:Carolina Raposo, Joao P. Barreto
Title: Accurate Reconstruction of Oriented 3D Points using Affine Correspondences
Abstract:
Affine correspondences (ACs) have been an active topic of research, namely for the recovery of surface normals. However, current solutions still suffer from the fact that even state-of-the-art affine feature detectors are inaccurate, and ACs are often contaminated by large levels of noise, yielding poor surface normals. This article provides new formulations for achieving epipolar geometry-consistent ACs, that, besides leading to linear solvers that are up to 30$ imes$ faster than the state-of-the-art alternatives, allow for a fast refinement scheme that significantly improves the quality of the noisy ACs. In addition, a tracker that automatically enforces the epipolar geometry is proposed, with experiments showing that it significantly outperforms competing methods in situations of low texture. This opens the way to application domains where the scenes are typically low textured, such as during arthroscopic procedures."



Paperid:1273
Authors:Seungryong Kim, Sabine Ssstrunk, Mathieu Salzmann
Title: Volumetric Transformer Networks
Abstract:
Existing techniques to encode spatial invariance within deep convolutional neural networks (CNNs) apply the same warping field to all the feature channels. This does not account for the fact that the individual feature channels can represent different semantic parts, which can undergo different spatial transformations w.r.t. a canonical configuration. To overcome this limitation, we introduce a learnable module, the volumetric transformer network (VTN), that predicts channel-wise warping fields so as to reconfigure intermediate CNN features spatially and channel-wisely. We design our VTN as an encoder-decoder network, with modules dedicated to letting the information flow across the feature channels, to account for the dependencies between the semantic parts. We further propose a loss function defined between the warped features of pairs of instances, which improves the localization ability of VTN. Our experiments show that VTN consistently boosts the features' representation power and consequently the networks' accuracy on fine-grained image recognition and instance-level image retrieval."



Paperid:1274
Authors:Benjamin Davidson, Mohsan S. Alvi, João F. Henriques
Title: 360(o) Camera Alignment via Segmentation
Abstract:
Panoramic 360º images taken under unconstrained conditions present a significant challenge to current state-of-the-art recognition pipelines, since the assumption of a mostly upright camera is no longer valid. In this work, we investigate how to solve this problem by fusing purely geometric cues, such as apparent vanishing points, with learned semantic cues, such as the expectation that some visual elements (e.g. doors) have a natural upright position. We train a deep neural network to leverage these cues to segment the image-space endpoints of an imagined “vertical axis”, which is orthogonal to the ground plane of a scene, thus levelling the camera. We show that our segmentation-based strategy significantly increases performance, reducing errors by half, compared to the current state-of-the-art on two datasets of 360º imagery. We also demonstrate the importance of 360º camera levelling by analysing its impact on downstream tasks, finding that incorrect levelling severely degrades the performance of real-world computer vision pipelines."



Paperid:1275
Authors:Bin Wang, Yongsheng Gao
Title: A Novel Line Integral Transform for 2D Affine-Invariant Shape Retrieval
Abstract:
Radon transform is a popular mathematical tool for shape analysis. However, it cannot handle affine deformation. Although its extended version, trace transform, allow us to construct affine invariants, they are less informative and computational expensive due to the loss of spatial relationship between trace lines and the extensive repeated calculation of transform. To address this issue, a novel line integral transform is proposed. We first use binding line pairs that have the desirable property of affine preserving as a reference frame to rewrite the diametrical dimension parameters of the lines in a relative manner which make them independent on affine transform. Along polar angle dimension of the line parameters, a moment-based normalization is then conducted to degrade the affine transform to similarity transform which can be easily normalized by Fourier transform. The proposed transform is not only invariant to affine transform, but also preserves the spatial relationship between line integrals which make it very informative. Another advantage of the proposed transform is that it is more efficient than the trace transform. Conducting it one time can allow us to achieve a 2D matrix of affine invariants. While conducting the trace transform once only generates a single feature and multiple trace transforms of different functionals are needed to derive more to make the descriptors informative. The effectiveness of the proposed transform has been validated on two types of standard shape test cases, affinely distorted contour shape dataset and region shape dataset, respectively."



Paperid:1276
Authors:Federico Baldassarre, Kevin Smith, Josephine Sullivan, Hossein Azizpour
Title: Explanation-based Weakly-supervised Learning of Visual Relations with Graph Networks
Abstract:
Visual relationship detection is fundamental for holistic image understanding. However, the localization and classification of (subject, predicate, object) triplets remain challenging tasks, due to the combinatorial explosion of possible relationships, their long-tailed distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels. A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we obtain a complete relation by recovering the subject and object of a predicted predicate. We present results comparable to recent fully- and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for human-object interaction, Visual Relationship Detection for generic object-to-object relations, and UnRel for unusual triplets; demonstrating robustness to non-comprehensive annotations and good few-shot generalization."



Paperid:1277
Authors:Sangryul Jeon, Dongbo Min, Seungryong Kim, Jihwan Choe, Kwanghoon Sohn
Title: Guided Semantic Flow
Abstract:
Establishing dense semantic correspondences requires dealing with large geometric variations caused by the unconstrained setting of images. To address such severe matching ambiguities, we introduce a novel approach, called {guided semantic flow}, based on the key insight that sparse yet reliable matches can effectively capture non-rigid geometric variations, and these confident matches can guide adjacent pixels to have similar solution spaces, reducing the matching ambiguities significantly. We realize this idea with learning-based selection of confident matches from an initial set of all pairwise matching scores and their propagation by a new differentiable upsampling layer based on moving least square concept. We take advantage of the guidance from reliable matches to refine the matching hypotheses through Gaussian parametric model in the subsequent matching pipeline. With the proposed method, state-of-the-art performance is attained on several standard benchmarks for semantic correspondence."



Paperid:1278
Authors:Mausoom Sarkar, Milan Aggarwal, Arneh Jain, Hiresh Gupta, Balaji Krishnamurthy
Title: Document Structure Extraction using Prior based High Resolution Hierarchical Semantic Segmentation
Abstract:
Structure extraction from document images has been a long-standing research topic due to its high impact on a wide range of practical applications. In this paper, we share our findings on employing a hierarchical semantic segmentation network for this task of structure extraction. We propose a prior based deep hierarchical CNN network architecture that enables document structure extraction using very high resolution(1800 x 1000) images. We divide the document image into overlapping horizontal strips such that the network segments a strip and uses its prediction mask as prior for predicting the segmentation of the subsequent strip. We perform experiments establishing the effectiveness of our strip based network architecture through ablation methods and comparison with low-resolution variations. Further, to demonstrate our network's capabilities, we train it on only one type of documents (Forms) and achieve state-of-the-art results over other general document datasets. We introduce our new human-annotated forms dataset and show that our method significantly outperforms different segmentation baselines on this dataset in extracting hierarchical structures.Our method is currently being used in Adobe's AEM Forms for automated conversion of paper and PDF forms to modern HTML based forms."



Paperid:1279
Authors:Matthias Tangemann, Matthias Kümmerer, Thomas S.A. Wallis, Matthias Bethge
Title: Measuring the Importance of Temporal Features in Video Saliency
Abstract:
Where people look when watching videos is believed to be heavily influenced by temporal patterns. In this work, we test this assumption by quantifying to which extent gaze on recent video saliency benchmarks can be predicted by a static baseline model. On the recent LEDOV dataset, we find that at least 75% of the explainable information as defined by a gold standard model can be explained using static features. Our baseline model ``DeepGaze MR'' even outperforms state-of-the-art video saliency models, despite deliberately ignoring all temporal patterns. Visual inspection of our static baseline’s failure cases shows that clear temporal effects on human gaze placement exist, but are both rare in the dataset and not captured by any of the recent video saliency models. To focus the development of video saliency models on better capturing temporal effects we construct a meta-dataset consisting of those examples requiring temporal information."



Paperid:1280
Authors:Haotian Tang, Zhijian Liu, Shengyu Zhao, Yujun Lin, Ji Lin, Hanrui Wang, Song Han
Title: Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution
Abstract:
Self-driving cars need to understand 3D scenes efficiently and accurately in order to drive safely. Given the limited hardware resources, existing 3D perception models are not able to recognize small instances (e.g., pedestrians, cyclists) very well due to the low-resolution voxelization and aggressive downsampling. To this end, we propose Sparse Point-Voxel Convolution (SPVConv), a lightweight 3D module that equips the vanilla Sparse Convolution with the high-resolution point-based branch. With negligible overhead, this point-based branch is able to preserve the fine details even from large outdoor scenes. To explore the spectrum of efficient 3D models, we first define a flexible architecture design space based on SPVConv, and we then present 3D Neural Architecture Search (3D-NAS) to search the optimal network architecture over this diverse design space efficiently and effectively. Experimental results validate that the resulting SPVNAS model is fast and accurate: it outperforms the state-of-the-art MinkowskiNet by 3.3%, ranking 1st on the competitive SemanticKITTI leaderboard. It also achieves 8-24× computation reduction and 3-35× measured speedup over MinkowskiNet and KPConv with higher accuracy. Finally, we transfer our method to 3D object detection, and it achieves consistent improvements over the one-stage detection baseline on KITTI."



Paperid:1281
Authors:Leonardo Citraro, Mateusz Koziński, Pascal Fua
Title: Towards Reliable Evaluation of Algorithms for Road Network Reconstruction from Aerial Images
Abstract:
Existing connectivity-oriented performance measures rank road delineation algorithms inconsistently, which makes it difficult to decide which one is best for a given application. We show that these inconsistencies stem from design flaws that make the metrics insensitive to whole classes of errors. This insensitivity is undesirable in metrics intended for capturing overall general quality of road reconstructions. In particular, the scores do not reflect the time needed for a human to fix the errors, because each one has to be fixed individually. To provide more reliable evaluation, we design three new metrics that are sensitive to all classes of errors. This sensitivity makes them more consistent even though they use very different approaches to comparing ground-truth and reconstructed road networks.We use both synthetic and real data to demonstrate this and advocate the use of these corrected metrics as a tool to gauge future progress."



Paperid:1282
Authors:Enrico Fini, Stéphane Lathuilière, Enver Sangineto, Moin Nabi, Elisa Ricci
Title: Online Continual Learning under Extreme Memory Constraints
Abstract:
Continual Learning (CL) aims to develop agents emulating the human ability to sequentially learn new tasks while being able to retain knowledge obtained from past experiences. In this paper, we introduce the novel problem of Memory-Constrained Online Continual Learning (MC-OCL) which imposes strict constraints on the memory overhead that a possible algorithm can use to avoid catastrophic forgetting. As most, if not all, previous CL methods violate these constraints, we propose an algorithmic solution to MC-OCL: Batch-level Distillation (BLD), a regularization-based CL approach which effectively balances stability and plasticity in order to learn from data streams, while preserving the ability to solve old tasks through distillation. Our extensive experimental evaluation, conducted on three publicly available benchmarks, empirically demonstrates that our approach successfully addresses the MC-OCL problem and achieves comparable accuracy to prior distillation methods requiring higher memory overhead."



Paperid:1283
Authors:Willi Menapace, Stéphane Lathuilière, Elisa Ricci
Title: Learning to Cluster under Domain Shift
Abstract:
While unsupervised domain adaptation methods based on deep architectures have achieved remarkable success in many computer vision tasks, they rely on a strong assumption, i.e. labeled source data must be available. In this work we overcome this assumption and we address the problem of transferring knowledge from a source to a target domain when both source and target data have no annotations. Inspired by recent works on deep clustering, our approach leverages information from data gathered from multiple source domains to build a domain-agnostic clustering model which is then refined at inference time when target data become available. Specifically, at training time we propose to optimize a novel information-theoretic loss which, coupled with domain-alignment layers, ensures that our model learns to correctly discover semantic labels while discarding domain-specific features. Importantly, our architecture design ensures that at inference time the resulting source model can be effectively adapted to the target domain without having access to source data, thanks to feature alignment and self-supervision. We evaluate the proposed approach in a variety of settings, considering several domain adaptation benchmarks and we show that our method is able to automatically discover relevant semantic information even in presence of few target samples and yields state-of-the-art results on multiple domain adaptation benchmarks."



Paperid:1284
Authors:Yueru Li, Shuyu Cheng, Hang Su, Jun Zhu
Title: Defense Against Adversarial Attacks via Controlling Gradient Leaking on Embedded Manifolds
Abstract:
Deep neural networks are vulnerable to adversarial attacks. Though various attempts have been made, it is still largely open to fully understand the existence of adversarial samples and thereby develop effective defense strategies. In this paper, we present a new perspective, namely gradient leaking hypothesis, to understand the existence of adversarial examples and to further motivate effective defense strategies. Specifically, we consider the low dimensional manifold structure of natural images, and empirically verify that the leakage of the gradient (w.r.t input) along the (approximately) perpendicular direction to the tangent space of data manifold is a reason for the vulnerability over adversarial attacks. Based on our investigation, we further present a new robust learning algorithm which encourages a larger gradient component in the tangent space of data manifold, suppressing the gradient leaking phenomenon consequently. Experiments on various tasks demonstrate the effectiveness of our algorithm despite its simplicity."



Paperid:1285
Authors:Markus Hofinger, Samuel Rota Bulò, Lorenzo Porzi, Arno Knapitsch, Thomas Pock, Peter Kontschieder
Title: Improving Optical Flow on a Pyramid Level
Abstract:
In this work we review the coarse-to-fine spatial feature pyramid concept, which is used in state-of-the-art optical flow estimation networks to make exploration of the pixel flow search space computationally tractable and efficient. Within an individual pyramid level, we improve the cost volume construction process by departing from a warping- to a sampling-based strategy, which avoids ghosting and hence enables us to better preserve fine flow details. We further amplify the positive effects through a level-specific, loss max-pooling strategy that adaptively shifts the focus of the learning process on under-performing predictions. Our second contribution revises the gradient flow across pyramid levels. The typical operations performed at each pyramid level can lead to noisy, or even contradicting gradients across levels. We show and discuss how properly blocking some of these gradient components leads to improved convergence and ultimately better performance. Finally, we introduce a distillation concept to counteract the issue of catastrophic forgetting and thus preserving knowledge over models sequentially trained on multiple datasets. Our findings are conceptually simple and easy to implement, yet result in compelling improvements on relevant error measures that we demonstrate via exhaustive ablations on datasets like Flying Chairs2, Flying Things, Sintel and KITTI. We establish new state-of-the-art results on the challenging Sintel and KITTI 2012 test datasets, and even show the portability of our findings to different optical flow and depth from stereo approaches."



Paperid:1286
Authors:Sungheon Park, Minsik Lee, Nojun Kwak
Title: Procrustean Regression Networks: Learning 3D Structure of Non-Rigid Objects from 2D Annotations
Abstract:
We propose a novel framework for training neural networks which is capable of learning 3D information of non-rigid objects when only 2D annotations are available as ground truths. Recently, there have been some approaches that incorporate the problem setting of non-rigid structure-from-motion (NRSfM) into deep learning to learn 3D structure reconstruction. The most important difficulty of NRSfM is to estimate both the rotation and deformation at the same time, and previous works handle this by regressing both of them. In this paper, we resolve this difficulty by proposing a loss function wherein the suitable rotation is automatically determined. Trained with the cost function consisting of the reprojection error and the low-rank term of aligned shapes, the network learns the 3D structures of such objects as human skeletons and faces during the training, whereas the testing is done in a single-frame basis. The proposed method can handle inputs with missing entries and experimental results validate that the proposed framework shows superior reconstruction performance to the state-of-the-art method on the Human 3.6M, 300-VW, and SURREAL dataset, even though the underlying network structure is very simple."



Paperid:1287
Authors:Duo Li, Anbang Yao, Qifeng Chen
Title: Learning to Learn Parameterized Classification Networks for Scalable Input Images
Abstract:
Convolutional Neural Networks (CNNs) do not have a predictable recognition behavior with respect to the input resolution change. This prevents the feasibility of deployment on different input image resolutions for a specific model. To achieve efficient and flexible image classification at runtime, we employ meta learners to generate convolutional weights of main networks for various input scales and maintain privatized Batch Normalization layers per scale. For improved training performance, we further utilize knowledge distillation on the fly over model predictions based on different input resolutions. The learned meta network could dynamically parameterize main networks to act on input images of arbitrary size with consistently better accuracy compared to individually trained models. Extensive experiments on the ImageNet demonstrate that our method achieves an improved accuracy-efficiency trade-off during the adaptive inference process. By switching executable input resolutions, our method could satisfy the requirement of fast adaption in different resource-constrained environments. Code and models are available at https://github.com/d-li14/SAN"



Paperid:1288
Authors:Yuanhao Wang, Ramzi Idoughi, Wolfgang Heidrich
Title: Stereo Event-based Particle Tracking Velocimetry for 3D Fluid Flow Reconstruction
Abstract:
Existing Particle Imaging Velocimetry techniques require the use of high-speed cameras to reconstruct time-resolved fluid flows. These cameras provides high-resolution images at high frame rates, which generates bandwidth and memory issues. By capturing only changes in the brightness with a very low latency and at low data rate, event-based cameras have the ability to tackle such issues. In this paper, we present a new framework that retrieves dense 3D measurements of the fluid velocity field using a pair of event-based cameras. First, we track particles inside the two event sequences in order to estimate their 2D velocity in the two sequences of images. A stereo-matching step is then performed to retrieve their 3D positions. These intermediate outputs are incorporated into an optimization framework that also includes physically plausible regularizers, in order to retrieve the 3D velocity field. Extensive experiments on both simulated and real data demonstrate the efficacy of our approach."



Paperid:1289
Authors:Charu Sharma, Manohar Kaul
Title: Simplicial Complex based Point Correspondence between Images warped onto Manifolds
Abstract:
Recent increase in the availability of warped images projected onto a manifold (e.g., omnidirectional spherical images), coupled with the success of higher-order assignment methods, has sparked an interest in the search for improved higher-order matching algorithms on warped images due to projection. Although currently, several existing methods “flatten"" such 3D images to use planar graph / hypergraph matching methods, they still suffer from severe distortions and other undesired artifacts, which result in inaccurate matching. Alternatively, current planar methods cannot be trivially extended to effectively match points on images warped onto manifolds. Hence, matching on these warped images persists as a formidable challenge. In this paper, we pose the assignment problem as finding a bijective map between two graph induced simplicial complexes, which are higher-order analogues of graphs. We propose a constrained quadratic assignment problem (QAP) that matches each p-skeleton of the simplicial complexes, iterating from the highest to the lowest dimension. The accuracy and robustness of our approach are illustrated on both synthetic and real-world spherical / warped (projected) images with known ground-truth correspondences. We significantly outperform existing state-of-the-art spherical matching methods on a diverse set of datasets."



Paperid:1290
Authors:Effrosyni Mavroudi, Benjamín Béjar Haro, René Vidal
Title: Representation Learning on Visual-Symbolic Graphs for Video Understanding
Abstract:
Events in natural videos typically arise from spatio-temporal interactions between actors and objects and involve multiple co-occurring activities and object classes. To capture this rich visual and semantic context, we propose using two graphs:(1) an attributed spatio-temporal visual graph whose nodes correspond to actors and objects and whose edges encode different types of interactions, and (2) a symbolic graph that models semantic relationships. We further propose a graph neural network for refining the representations of actors, objects and their interactions on the resulting hybrid graph. Our framework goes beyond current approaches that assume nodes and edges of the same type, operate on a fixed graph structure and do not use a symbolic graph. In particular, our framework: a) has specialized attention-based aggregation functions for different node and edge types; b) uses visual edge features; c) integrates visual evidence with label relationships; and d) performs global reasoning in the semantic space. Experiments on challenging video understanding tasks, such as temporal action localization on the Charades dataset, show that the proposed method leads to state-of-the-art performance."



Paperid:1291
Authors:Xuepeng Shi, Zhixiang Chen, Tae-Kyun Kim
Title: Distance-Normalized Unified Representation for Monocular 3D Object Detection
Abstract:
Monocular 3D object detection plays an important role in autonomous driving and still remains challenging. To achieve fast and accurate monocular 3D object detection, we introduce a single-stage and multi-scale framework to learn a unified representation for objects within different distance ranges, termed as UR3D. UR3D formulates different tasks of detection by exploiting the scale information, to reduce model capacity requirement and achieve accurate monocular 3D object detection. Besides, distance estimation is enhanced by a distance-guided NMS, which automatically selects candidate boxes with better distance estimates. In addition, an efficient fully convolutional cascaded point regression method is proposed to infer accurate locations of the projected 2D corners and centers of 3D boxes, which can be used to recover object physical size and orientation by a projection-consistency loss. Experimental results on the challenging KITTI autonomous driving dataset show that UR3D achieves accurate monocular 3D object detection with a compact architecture."



Paperid:1292
Authors:Shanyu Xiao, Liangrui Peng, Ruijie Yan, Keyu An, Gang Yao, Jaesik Min
Title: Sequential Deformation for Accurate Scene Text Detection
Abstract:
Scene text detection has been significantly advanced over recent years, especially after the emergence of deep neural network. However, due to high diversity of scene texts in scale, orientation, shape and aspect ratio, as well as the inherent limitation of convolutional neural network for geometric transformations, to achieve accurate scene text detection is still an open problem. In this paper, we propose a novel sequential deformation method to effectively model the line-shape of scene text. An auxiliary character counting supervision is further introduced to guide the sequential offset prediction. The whole network can be easily optimized through an end-to-end multi-task manner. Extensive experiments are conducted on public scene text detection datasets including ICDAR 2017 MLT, ICDAR 2015, Total-text and SCUT-CTW1500. The experimental results demonstrate that the proposed method has outperformed previous state-of-the-art methods."



Paperid:1293
Authors:Yiming Wang, Alessio Del Bue
Title: Where to Explore Next? ExHistCNN for History-aware Autonomous 3D Exploration
Abstract:
In this work we address the problem of autonomous 3D exploration of an unknown indoor environment using a depth camera. We cast the problem as the estimation of the Next Best View (NBV) that maximises the coverage of the unknown area. We do this by re-formulating NBV estimation as a classification problem and we propose a novel learning-based metric that encodes both, the current 3D observation (a depth frame) and the history of the ongoing reconstruction. One of the major contributions of this work is about introducing a new representation for the 3D reconstruction history as an auxiliary utility map which is efficiently coupled with the current depth observation. With both pieces of information, we train a light-weight CNN, named ExHistCNN, that estimates the NBV as a set of directions towards which the depth sensor finds most unexplored areas. We perform extensive evaluation on both synthetic and real room scans demonstrating that the proposed ExHistCNN is able to approach the exploration performance of an oracle using the complete knowledge of the 3D environment."



Paperid:1294
Authors:Robert Mendel, Luis Antonio de Souza Jr, David Rauber, João Paulo Papa, Christoph Palm
Title: Semi-Supervised Segmentation based on Error-Correcting Supervision
Abstract:
Pixel-level classification is an essential part of computer vision. For learning from labeled data, many powerful deep learning models have been developed recently. In this work, we augment such supervised segmentation models by allowing them to learn from unlabeled data. Our semi-supervised approach, termed Error-Correcting Supervision, leverages a collaborative strategy. Apart from the supervised training on the labeled data, the segmentation network is judged by an additional network. The secondary correction network learns on the labeled data to optimally spot correct predictions, as well as to amend incorrect ones. As auxiliary regularization term, the corrector directly influences the supervised training of the segmentation network. On unlabeled data, the output of the correction network is essential to create a proxy for the unknown truth. The corrector's output is combined with the segmentation network's prediction to form the new target. We propose a loss function that incorporates both the pseudo-labels as well as the predictive certainty of the correction network. Our approach can easily be added to supervised segmentation models. We show consistent improvements over a supervised baseline on experiments on both the Pascal VOC 2012 and the Cityscapes datasets with varying amounts of labeled data. "



Paperid:1295
Authors:Junde Li, Swaroop Ghosh
Title: Quantum-soft QUBO Suppression for Accurate Object Detection
Abstract:
Non-maximum suppression (NMS) has been adopted by default for removing redundant object detections for decades. It eliminates false positives by only keeping the image M with highest detection score and images whose overlap ratio with M is less than a predefined threshold. However, this greedy algorithm may not work well for object detection under occlusion scenario where true positives with lower detection scores are possibly suppressed. In this paper, we first map the task of removing redundant detections into Quadratic Unconstrained Binary Optimization (QUBO) framework that consists of detection score from each bounding box and overlap ratio between pair of bounding boxes. Next, we solve the QUBO problem using the proposed Quantum-soft QUBO Suppression algorithm for fast and accurate detection by exploiting quantum computing advantages. Experiments indicate that our method improves mAP from 74.20 to 75.11 for PASCAL VOC2007. For benchmark pedestrian detection CityPersons, it consistently outperforms NMS and soft-NMS for Reasonable subset."



Paperid:1296
Authors:Ürün Dogan, Aniket Anand Deshmukh, Marcin Bronislaw Machura, Christian Igel
Title: Label-similarity Curriculum Learning
Abstract:
Curriculum learning can improve neural network training by guiding the optimization to desirable optima. We propose a novel curriculum learning approach for image classification that adapts the loss function by changing the label representation. The idea is to use a probability distribution over classes as target label, where the class probabilities reflect the similarity to the true class. Gradually, this label representation is shifted towards the standard one-hot-encoding. That is, in the beginning minor mistakes are corrected less than large mistakes, resembling a teaching process in which broad concepts are explained first before subtle differences are taught. The class similarity can be based on prior knowledge. For the special case of the labels being natural words, we propose a generic way to automatically compute the similarities. The natural words are embedded into Euclidean space using a standard word embedding. The probability of each class is then a function of the cosine similarity between the vector representations of the class and the true label. The proposed label-similarity curriculum learning (LCL) approach was empirically evaluated using several popular deep learning architectures for image classification tasks applied to five datasets including ImageNet, CIFAR100, and AWA2. In all scenarios, LCL was able to improve the classification accuracy on the test data compared to standard training. Code to reproduce results is available at https://github.com/speedystream/LCL."



Paperid:1297
Authors:Ayushi Dutta, Yashaswi Verma, C.V. Jawahar
Title: Recurrent Image Annotation With Explicit Inter-Label Dependencies
Abstract:
Inspired by the success of the CNN-RNN framework in the image captioning task, several works have explored this in multi-label image annotation with the hope that the RNN followed by a CNN would encode inter-label dependencies better than using a CNN alone. To do so, for each training sample, the earlier methods converted the ground-truth label-set into a sequence of labels based on their frequencies (e.g., rare-to-frequent) for training the RNN. However, since the ground-truth is an unordered set of labels, imposing a fixed and predefined sequence on them does not naturally align with this task. To address this, some of the recent papers have proposed techniques that are capable to train the RNN without feeding the ground-truth labels in a particular sequence/order. However, most of these techniques leave it to the RNN to implicitly choose one sequence for the ground-truth labels corresponding to each sample at the time of training, thus making it inherently biased. In this paper, we address this limitation and propose a novel approach in which the RNN is explicitly forced to learn multiple relevant inter-label dependencies, without the need of feeding the ground-truth in any particular order. Using thorough empirical comparisons, we demonstrate that our approach outperforms several state-of-the-art techniques on two popular datasets (MS-COCO and NUS-WIDE). Additionally, it provides a new perspecitve of looking at an unordered set of labels as equivalent to a collection of different permutations (sequences) of those labels, thus naturally aligning with the image annotation task. Our code is available at: https://github.com/ayushidutta/multi-order-rnn"



Paperid:1298
Authors:Jing Yao, Danfeng Hong, Jocelyn Chanussot, Deyu Meng, Xiaoxiang Zhu , Zongben Xu
Title: Cross-Attention in Coupled Unmixing Nets for Unsupervised Hyperspectral Super-Resolution
Abstract:
The recent advancement of deep learning techniques has made great progress on hyperspectral image super-resolution (HSI-SR). Yet the development of unsupervised deep networks remains challenging for this task. To this end, we propose a novel coupled unmixing network with a cross-attention mechanism, CUCaNet for short, to enhance the spatial resolution of HSI by means of higher-spatial-resolution multispectral image (MSI). Inspired by coupled spectral unmixing, a two-stream convolutional autoencoder framework is taken as backbone to jointly decompose MS and HS data into a spectrally meaningful basis and corresponding coefficients. CUCaNet is capable of adaptively learning spectral and spatial response functions from HS-MS correspondences by enforcing reasonable consistency assumptions on the networks. Moreover, a cross-attention module is devised to yield more effective spectral-spatial information transfer in networks. Extensive experiments are conducted on three widely-used HS-MS datasets in comparison with state-of-the-art HSI-SR models, demonstrating the superiority of the CUCaNet in the HSI-SR application. Furthermore, the codes and datasets are made available at: https://github.com/danfenghong/ECCV2020_CUCaNet."



Paperid:1299
Authors:Tyler Zhu, Per Karlsson, Christoph Bregler
Title: SimPose: Effectively Learning DensePose and Surface Normals of People from Simulated Data
Abstract:
With a proliferation of generic domain-adaptation approaches, we report a simple yet effective technique for learning difficult per-pixel 2.5D and 3D regression representations of articulated people. We obtained strong sim-to-real domain generalization for the 2.5D DensePose estimation task and the 3D human surface normal estimation task. On the multi-person DensePose MSCOCO benchmark, our approach outperforms the state-of-the-art methods which are trained on real densely labelled images. This is an important result since obtaining human manifold's intrinsic uv coordinates on real images is time consuming and prone to labeling noise. Additionally, we present our model's 3D surface normal predictions on the MSCOCO dataset that lacks any real 3D surface normal labels. The key to our approach is to mitigate the ""Inter-domain Covariate Shift"" with a carefully selected training batch from a mixture of domain samples, a deep batch-normalized residual network, and a modified multi-task learning objective. Our approach is complementary to existing domain-adaptation techniques and can be applied to other dense per-pixel pose estimation problems. "



Paperid:1300
Authors:Yu-Hui Lee, Shang-Hong Lai
Title: ByeGlassesGAN: Identity Preserving Eyeglasses Removal for Face Images
Abstract:
In this paper, we propose a novel image-to-image GAN framework for eyeglasses removal, called ByeGlassesGAN, which is used to automatically detect the position of eyeglasses and then remove them from face images. Our ByeGlassesGAN consists of an encoder, a face decoder, and a segmentation decoder. The encoder is responsible for extracting information from the source face image, and the face decoder utilizes this information to generate glasses-removed images. The segmentation decoder is included to predict the segmentation mask of eyeglasses and completed face region. The feature vectors generated by the segmentation decoder are shared with the face decoder, which facilitates better reconstruction results. Our experiments show that ByeGlassesGAN can provide visually appealing results in the eyeglasses-removed face images even for semi-transparent color eyeglasses or glasses with glare. Furthermore, we demonstrate significant improvement in face recognition accuracy for face images with glasses by applying our method as a pre-processing step in our face recognition experiment."



Paperid:1301
Authors:Ying Wang, Yadong Lu, Tijmen Blankevoort
Title: Differentiable Joint Pruning and Quantization for Hardware Efficiency
Abstract:
We present a differentiable joint pruning and quantization (DJPQ) scheme. We frame neural network compression as a joint gradient-based optimization problem, trading off between model pruning and quantization automatically for hardware efficiency. DJPQ incorporates variational information bottleneck based structured pruning and mixed-bit precision quantization into a single differentiable loss function. While previous works always consider pruning and quantization separately, our method enables users to find the optimal trade-off between both in a single training procedure. To utilize the method for more efficient hardware inference, we extend DJPQ to integrate structured pruning with power-of-two bit-restricted quantization. We show that DJPQ significantly reduces the number of Bit-Operations (BOPs) for several networks while maintaining the top-1 accuracy of original floating-point models (e.g.,53x BOPs reduction in ResNet18 on ImageNet, 43x in MobileNetV2). Compared to the conventional two-stage approach, which optimizes pruning and quantization independently, our scheme outperforms in terms of both accuracy and BOPs. Even when considering bit-restricted quantization, DJPQ achieves larger compression ratios and better accuracy than the two-stage approach."



Paperid:1302
Authors:Rolandos Alexandros Potamias, Jiali Zheng, Stylianos Ploumpis, Giorgos Bouritsas, Evangelos Ververas, Stefanos Zafeiriou
Title: Learning to Generate Customized Dynamic 3D Facial Expressions
Abstract:
Recent advances in deep learning have significantly pushed the state-of-the-art in photorealistic video animation given a single image. In this paper, we extrapolate those advances to the 3D domain, by studying 3D image-to-video translation with a particular focus on 4D facial expressions. Although 3D facial generative models have been widely explored during the past years, 4D animation remains relatively unexplored. To this end, in this study we employ a deep mesh encoder-decoder like architecture to synthesize realistic high resolution facial expressions by using a single neutral frame along with an expression identification. In addition, processing 3D meshes remains a non-trivial task compared to data that live on grid-like structures, such as images. Given the recent progress in mesh processing with graph convolutions, we make use of a recently introduced learnable operator which acts directly on the mesh structure by taking advantage of local vertex orderings. In order to generalize to 4D facial expressions across subjects, we trained our model using a high resolution dataset with 4D scans of six facial expressions from 180 subjects. Experimental results demonstrate that our approach preserves the subject's identity information even for unseen subjects and generates high quality expressions. To the best of our knowledge, this is the first study tackling the problem of 4D facial expression synthesis."



Paperid:1303
Authors:Jan Brejcha, Michal Lukáč, Yannick Hold-Geoffroy, Oliver Wang, Martin Čadík
Title: LandscapeAR: Large Scale Outdoor Augmented Reality by Matching Photographs with Terrain Models Using Learned Descriptors
Abstract:
We introduce a solution to large scale Augmented Reality for outdoor scenes by registering camera images to textured Digital Elevation Models (DEMs). To accomodate the inherent differences in appearance between real images and DEMs, we train a cross-domain feature descriptor using Structure From Motion (SFM) reconstructions to acquire training data. Our method runs efficiently on a mobile device, and outperforms existing learned and hand designed feature descriptors for this task."



Paperid:1304
Authors:Xin Li, Xin Jin, Jianxin Lin, Sen Liu, Yaojun Wu, Tao Yu, Wei Zhou , Zhibo Chen
Title: Learning Disentangled Feature Representation for Hybrid-distorted Image Restoration
Abstract:
Hybrid-distorted image restoration (HD-IR) is dedicated to restore real distorted image that is degraded by multiple distortions. Existing HD-IR approaches usually ignore the inherent interference among hybrid distortions which compromises the restoration performance. To decompose such interference, we introduce the concept of Disentangled Feature Learning to achieve the feature-level divide-and-conquer of hybrid distortions. Specifically, we propose the feature disentanglement module (FDM) to distribute feature representations of different distortions into different channels by revising gain-control-based normalization. We also propose a feature aggregation module (FAM) with channel-wise attention to adaptively filter out the distortion representations and aggregate useful content information from different channels for the construction of raw image. The effectiveness of the proposed scheme is verified by visualizing the correlation matrix of features and channel responses of different distortions. Extensive experimental results also prove superior performance of our approach compared with the latest HD-IR schemes."



Paperid:1305
Authors:Sixue Gong, Xiaoming Liu, Anil K. Jain
Title: Jointly De-biasing Face Recognition and Demographic Attribute Estimation
Abstract:
We address the problem of bias in automated face recognition and demographic attribute estimation algorithms, where errors are lower on certain cohorts belonging to specific demographic groups. We present a novel de-biasing adversarial network (DebFace) that learns to extract disentangled feature representations for both unbiased face recognition and demographics estimation. The proposed network consists of one identity classifier and three demographic classifiers (for gender, age, and race) that are trained to distinguish identity and demographic attributes, respectively. Adversarial learning is adopted to minimize correlation among feature factors so as to abate bias influence from other factors. We also design a new scheme to combine demographics with identity features to strengthen robustness of face representation in different demographic groups. The experimental results show that our approach is able to reduce bias in face recognition as well as demographics estimation while achieving state-of-the-art performance."



Paperid:1306
Authors:Olga Veksler
Title: Regularized Loss for Weakly Supervised Single Class Semantic Segmentation
Abstract:
Fully supervised semantic segmentation is highly successful, but obtaining dense ground truth is expensive. Thus there is an increasing interest in weakly supervised approaches. We propose a new weakly supervised method for training CNNs to segment an object of a single class of interest. Instead of ground truth, we guide training with a regularized loss function. Regularized loss models prior knowledge about the likely object shape properties and thus guides segmentation towards the more plausible shapes. Training CNNs with regularized loss is difficult. We develop an annealing strategy that is crucial for successful training. The advantage of our method is simplicity: we use standard CNN architectures and intuitive and computationally efficient loss function. Furthermore, we apply the same loss function for any task/dataset, without any tailoring. We first evaluate our approach for salient object segmentation and co-segmentation. These tasks naturally involve one object class of interest. In some cases, our results are only a few points of standard performance measure behind those obtained training the same CNN with full supervision, and state-of-the art results in weakly supervised setting. Then we adapt our approach to weakly supervised multi-class semantic segmentation and obtain state-of-the-art results. "



Paperid:1307
Authors:Chankyu Lee, Adarsh Kumar Kosta, Alex Zihao Zhu, Kenneth Chaney, Kostas Daniilidis, Kaushik Roy
Title: Spike-FlowNet: Event-based Optical Flow Estimation with Energy-Efficient Hybrid Neural Networks
Abstract:
Event-based cameras display great potential for a variety of tasks such as high-speed motion detection and navigation in low-light environments where conventional frame-based cameras suffer critically. This is attributed to their high temporal resolution, high dynamic range, and low-power consumption. However, conventional computer vision methods as well as deep Analog Neural Networks (ANNs) are not suited to work well with the asynchronous and discrete nature of event camera outputs. Spiking Neural Networks (SNNs) serve as ideal paradigms to handle event camera outputs, but deep SNNs suffer in terms of performance due to the spike vanishing phenomenon. To overcome these issues, we present Spike-FlowNet, a deep hybrid neural network architecture integrating SNNs and ANNs for efficiently estimating optical flow from sparse event camera outputs without sacrificing the performance. The network is end-to-end trained with self-supervised learning on Multi-Vehicle Stereo Event Camera (MVSEC) dataset. Spike-FlowNet outperforms its corresponding ANN-based method in terms of the optical flow prediction capability while providing significant computational efficiency."



Paperid:1308
Authors:Aditya Golatkar, Alessandro Achille, Stefano Soatto
Title: Forgetting Outside the Box: Scrubbing Deep Networks of Information Accessible from Input-Output Observations
Abstract:
We describe a procedure for removing dependency on a cohort of training data from a trained deep network that improves upon and generalizes previous methods to different readout functions, and can be extended to ensure forgetting in the final activations of the network. We introduce a new bound on how much information can be extracted per query about the forgotten cohort from a black-box network for which only the input-output behavior is observed. The proposed forgetting procedure has a deterministic part derived from the differential equations of a linearized version of the model, and a stochastic part that ensures information destruction by adding noise tailored to the geometry of the loss landscape. We exploit the connections between the final activations and weight dynamics of a DNN inspired by Neural Tangent Kernels to compute the information in the final activations."



Paperid:1309
Authors:Saima Sharmin, Nitin Rathi, Priyadarshini Panda, Kaushik Roy
Title: Inherent Adversarial Robustness of Deep Spiking Neural Networks: Effects of Discrete Input Encoding and Non-Linear Activations
Abstract:
In the recent quest for trustworthy neural networks, we present Spiking Neural Network (SNN) as a potential candidate for inherent robustness against adversarial attacks. In this work, we demonstrate that adversarial accuracy of SNNs under gradient-based attacks is higher than their non-spiking counterparts for CIFAR datasets on deep VGG and ResNet architectures, particularly in blackbox attack scenario. We at-tribute this robustness to two fundamental characteristics of SNNs and analyze their effects. First, we exhibit that input discretization introduced by the Poisson encoder improves adversarial robustness with reduced number of timesteps. Second, we quantify the amount of adversarial accuracy with increased leak rate in Leaky-Integrate-Fire (LIF) neurons. Our results suggest that SNNs trained with LIF neurons and smaller number of timesteps are more robust than the ones with IF (Integrate-Fire) neurons and larger number of timesteps. Also we overcome the bottleneck of creating gradient-based adversarial inputs in temporal domain by proposing a technique for crafting attacks from SNN"



Paperid:1310
Authors:Baris Gecer, Alexandros Lattas, Stylianos Ploumpis, Jiankang Deng, Athanasios Papaioannou, Stylianos Moschoglou, Stefanos Zafeiriou
Title: Synthesizing Coupled 3D Face Modalities by Trunk-Branch Generative Adversarial Networks
Abstract:
Generating realistic 3D faces is of high importance for computer graphics and computer vision applications. Generally, research on 3D face generation revolves around linear statistical models of the facial surface. Nevertheless, these models cannot represent faithfully either the facial texture or the normals of the face, which are very crucial for photo-realistic face synthesis. Recently, it was demonstrated that Generative Adversarial Networks (GANs) can be used for generating high-quality textures of faces. Nevertheless, the generation process either omits the geometry and normals, or independent processes are used to produce 3D shape information. In this paper, we present the first methodology that generates high-quality texture, shape, and normals jointly, which can be used for photo-realistic synthesis. To do so, we propose a novel GAN that can generate data from different modalities while exploiting their correlations. Furthermore, we demonstrate how we can condition the generation on the expression and create faces with various facial expressions. The qualitative results shown in this paper is compressed due to size limitations, full-resolution results and the accompanying video that can be found in the supplementary materials."



Paperid:1311
Authors:Dídac Surís, Dave Epstein, Heng Ji, Shih-Fu Chang, Carl Vondrick
Title: Learning to Learn Words from Visual Scenes
Abstract:
Language acquisition is the process of learning words from the surrounding scene. We introduce a meta-learning framework that mph{learns how to learn} word representations from unconstrained scenes. We leverage the natural compositional structure of language to create training episodes that cause a meta-learner to learn strong policies for language acquisition. Experiments on two datasets show that our approach is able to more rapidly acquire novel words as well as more robustly generalize to unseen compositions, significantly outperforming established baselines. A key advantage of our approach is that it is data efficient, allowing representations to be learned from scratch without language pre-training. Visualizations and analysis suggest visual information helps our approach learn a rich cross-modal representation from minimal examples."



Paperid:1312
Authors:Mahdi S. Hosseini, Lyndon Chan, Weimin Huang, Yichen Wang, Danial Hasan, Corwyn Rowsell, Savvas Damaskinos, Konstantinos N. Plataniotis
Title: On Transferability of Histological Tissue Labels in Computational Pathology
Abstract:
Deep learning tools in computational pathology, unlike natural vision tasks, face with limited histological tissue labels for classification. This is due to expensive procedure of annotation done by expert pathologist. As a result, the current models are limited to particular diagnostic task in mind where the training workflow is repeated for different organ sites and diseases. In this paper, we explore the possibility of transferring diagnostically-relevant histology labels from a source-domain into multiple target-domains to classify similar tissue structures and cancer grades. We achieve this by training a Convolutional Neural Network (CNN) model on a source-domain of diverse histological tissue labels for classification and then transfer them to different target domains for diagnosis without re-training/fine-tuning (zero-shot). We expedite this by an efficient color augmentation to account for color disparity across different tissue scans and conduct thorough experiments for evaluation."



Paperid:1313
Authors:Dimitri Zhukov, Jean-Baptiste Alayrac, Ivan Laptev, Josef Sivic
Title: Learning Actionness via Long-range Temporal Order Verification
Abstract:
Current methods for action recognition typically rely on supervision provided by manual labeling. Such methods, however, do not scale well given the high burden of manual video annotation and a very large number of possible actions. The annotation is particularly difficult for temporal action localization where large parts of the video present no action, or background. To address these challenges, we here propose a self-supervised and generic method to isolate actions from their background. We build on the observation that actions often follow a particular temporal order and, hence, can be predicted by other actions in the same video. As consecutive actions might be separated by minutes, differently to prior work on the arrow of time, we here exploit long-range temporal relations in 10-20 minutes long videos. To this end, we propose a new model that learns actionness via a self-supervised proxy task of order verification. The model assigns high actionness scores to clips which order is easy to predict from other clips in the video. To obtain a powerful and action-agnostic model, we train it on the large-scale unlabeled HowTo100M dataset with highly diverse actions from instructional videos. We validate our method on the task of action localization and demonstrate consistent improvements when combined with other recent weakly-supervised methods."



Paperid:1314
Authors:Laurie Bose, Piotr Dudek, Jianing Chen, Stephen J. Carey, Walterio W. Mayol-Cuevas
Title: Fully Embedding Fast Convolutional Networks on Pixel Processor Arrays
Abstract:
We present a novel method of CNN inference for pixel processor array (PPA) vision sensors, designed to take advantage of their massive parallelism and analog compute capabilities. PPA sensors consist of an array of processing elements (PEs), with each PE capable of light capture, data storage and computation, allowing various computer vision processing to be executed directly upon the sensor device. The key idea behind our approach is storing network weights ""in-pixel"" within the PEs of the PPA sensor itself to allow various computations, such as multiple different image convolutions, to be carried out in parallel. Our approach can perform convolutional layers, max pooling, ReLu, and a final fully connected layer entirely upon the PPA sensor, while leaving no untapped computational resources. This is in contrast to previous works that only use a sensor-level processing to sequentially compute image convolutions, and must transfer data to an external digital processor to complete the computation. We demonstrate our approach on the SCAMP-5 vision system, performing inference of a MNIST digit classification network at over 3000 frames per second and over 93% classification accuracy. This is the first work demonstrating CNN inference conducted entirely upon the processor array of a PPA vision sensor device, requiring no external processing."



Paperid:1315
Authors:Youngmin Baek, Seung Shin, Jeonghun Baek, Sungrae Park, Junyeop Lee , Daehyun Nam, Hwalsuk Lee
Title: Character Region Attention For Text Spotting
Abstract:
A scene text spotter is composed of text detection and recognition modules. Many studies have been conducted to unify these modules into an end-to-end trainable model to achieve better performance. A typical architecture places detection and recognition modules into separate branches, and a RoI pooling is commonly used to let the branches share a visual feature. However, there still exists a chance of establishing a more complimentary connection between the modules when adopting recognizer that uses attention-based decoder and detector that represents spatial information of the character regions. This is possible since the two modules share a common sub-task which is to find the location of the character regions. Based on the insight, we construct a tightly coupled single pipeline model. This architecture is formed by utilizing detection outputs in the recognizer and propagating the recognition loss through the detection stage. The use of character score map helps the recognizer attend better to the character center points, and the recognition loss propagation to the detector module enhances the localization of the character regions. Also, a strengthened sharing stage allows feature rectification and boundary localization of arbitrary-shaped text regions. Extensive experiments demonstrate state-of-the-art performance in publicly available straight and curved benchmark dataset."



Paperid:1316
Authors:Anh-Huy Phan, Konstantin Sobolev, Konstantin Sozykin, Dmitry Ermilov , Julia Gusak, Petr Tichavský, Valeriy Glukhov, Ivan Oseledets, Andrzej Cichocki
Title: Stable Low-rank Tensor Decomposition for Compression of Convolutional Neural Network
Abstract:
Most state-of-the-art deep neural networks are overparameterized and exhibit a high computational cost. A straightforward approach to this problem is to replace convolutional kernels with its low-rank tensor approximations, whereas the Canonical Polyadic tensor Decomposition is one of the most suited models. However, fitting the convolutional tensors by numerical optimization algorithms often encounters diverging components, i.e., extremely large rank-one tensors but canceling each other. Such degeneracy often causes the non-interpretable result and numerical instability for the neural network fine-tuning. This paper is the first study on degeneracy in the tensor decomposition of convolutional kernels. We present a novel method, which can stabilize the low-rank approximation of convolutional kernels and ensure efficient compression while preserving the high-quality performance of the neural networks. We evaluate our approach on popular CNN architectures for image classification and show that our method results in much lower accuracy degradation and provides consistent performance."



Paperid:1317
Authors:Yuan Wu, Diana Inkpen, Ahmed El-Roby
Title: Dual Mixup Regularized Learning for Adversarial Domain Adaptation
Abstract:
Recent advances on unsupervised domain adaptation (UDA) rely on adversarial learning to disentangle the explanatory and transferable features for domain adaptation. However, there are two issues with the existing methods. First, the discriminability of the latent space cannot be fully guaranteed without considering the class-aware information in the target domain. Second, samples from the source and target domains alone are not sufficient for domain-invariant feature extracting in the latent space. In order to alleviate the above issues, we propose a dual mixup regularized learning (DMRL) method for UDA, which not only guides the classifier in enhancing consistent predictions in-between samples, but also enriches the intrinsic structures of the latent space. The DMRL jointly conducts category and domain mixup regularizations on pixel level to improve the effectiveness of models. A series of empirical studies on four domain adaptation benchmarks demonstrate that our approach can achieve the state-of-the-art."



Paperid:1318
Authors:Jiaming Song, Yann Dauphin, Michael Auli, Tengyu Ma
Title: Robust and On-the-fly Dataset Denoising for Image Classification
Abstract:
Memorization in over-parameterized neural networks could severely hurt generalization in the presence of mislabeled examples. However, mislabeled examples are hard to avoid in extremely large datasets collected with weak supervision. We address this problem by reasoning counterfactually about the loss distribution of examples with uniform random labels had they were trained with the real examples, and use this information to remove noisy examples from the training set. First, we observe that examples with uniform random labels have higher losses when trained with stochastic gradient descent under large learning rates. Then, we propose to model the loss distribution of the counterfactual examples using only the network parameters, which is able to model such examples with remarkable success. Finally, we propose to remove examples whose loss exceeds a certain quantile of the modeled loss distribution.This leads to our approach, On-the-fly Data Denoising (ODD), a simple yet effective algorithm that is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training. ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M."



Paperid:1319
Authors:Connor Henley, Tomohiro Maeda, Tristan Swedish, Ramesh Raskar
Title: Imaging Behind Occluders Using Two-Bounce Light
Abstract:
We introduce the new non-line-of-sight imaging problem of mph{imaging behind an occluder}. The behind-an-occluder problem can be solved if the hidden space is flanked by opposing visible surfaces. We illuminate one surface and observe light that scatters off of the opposing surface after traveling through the hidden space. Hidden objects attenuate light that passes through the hidden space, leaving an observable signature that can be used to reconstruct their shape. Our method is experimentally simple--we use an eye-safe laser pointer as a light source, and off-the-shelf RGB or RGB-D cameras to estimate the geometry of relay surfaces and observe two-bounce light. We analyze the photometric and geometric challenges of this new imaging problem, and develop a robust method that produces high-quality 3D reconstructions in uncontrolled settings where relay surfaces may be non-planar."



Paperid:1320
Authors:Yandong Li, Di Huang, Danfeng Qin, Liqiang Wang, Boqing Gong
Title: Improving Object Detection with Selective Self-Supervised Self-Training
Abstract:
We study how to leverage Web images to augment human-curated object detection datasets. Our approach is two-pronged. On the one hand, we retrieve Web images by image-to-image search, which incurs less domain shift from the curated data than other search methods. The Web images are diverse, supplying a wide variety of object poses, appearances, their interactions with the context, etc. On the other hand, we propose a novel learning method motivated by two parallel lines of work that explore unlabeled data for image classification: self-training and self-supervised learning. They fail to improve object detectors in their vanilla forms due to the domain gap between the Web images and curated datasets. To tackle this challenge, we propose a selective net to rectify the supervision signals in Web images. It not only identifies positive bounding boxes but also creates a safe zone for mining hard negative boxes. We report state-of-the-art results on detecting backpacks and chairs from everyday scenes, along with other challenging object classes. "



Paperid:1321
Authors:Rohan Chabra, Jan E. Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove, Richard Newcombe
Title: Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction
Abstract:
Efficiently reconstructing complex and intricate surfaces at scale is a long-standing goal in machine perception. To address this problem we introduce Deep Local Shapes (DeepLS), a deep shape representation that enables high-quality 3D shape representation without prohibitive memory requirements. DeepLS replaces the dense volumetric signed distance function (SDF) representation used in traditional surface reconstruction systems with a learned continuous SDF defined by a neural network, inspired by recent work such as DeepSDF. Unlike DeepSDF, which represents an object-level SDF with a neural network and a single latent code, we store a grid of independent latent codes, each responsible for storing information about surfaces in a small local neighborhood. This decomposition of scenes into local shapes simplifies the prior distribution that the network must learn, and also enables efficient inference. %We enforce global shape consistency implicitly by fitting local shape codes from an receptive field that overlaps with neighboring voxels. We demonstrate the effectiveness and generalization power of DeepLS by showing object shape encoding and reconstructions of full scenes, where DeepLS delivers high compression, accuracy, and local shape completion."



Paperid:1322
Authors:Aditya Sanghi
Title: Info3D: Representation Learning on 3D Objects using Mutual Information Maximization and Contrastive Learning
Abstract:
A major endeavor of computer vision is to represent, understand and extract structure from 3D data. Towards this goal, unsupervised learning is a powerful and necessary tool. Most current unsupervised methods for 3D shape analysis use datasets that are aligned, require objects to be reconstructed and suffer from deteriorated performance on downstream tasks. To solve these issues we propose to extend the InfoMax and contrastive learning principles on 3D shapes. We show that we can maximize the mutual information between 3D objects and their ""chunks"" to improve the representations in aligned datasets. Furthermore, we can achieve rotation invariance in SO(3) group by maximizing the mutual information between the 3D objects and their geometric transformed versions. Finally, we conduct several experiments such as clustering, transfer learning, shape retrieval, and achieve state of art results."



Paperid:1323
Authors:Sahin Olut, Zhengyang Shen, Zhenlin Xu, Samuel Gerber, Marc Niethammer
Title: Adversarial Data Augmentation via Deformation Statistics
Abstract:
Deep learning models have been successful in computer vision and medical image analysis. However, training these models frequently requires large labeled image sets whose creation is often very time and labor intensive, for example, in the context of 3D segmentations. Approaches capable of training deep segmentation networks with a limited number of labeled samples are therefore highly desirable. Data augmentation or semi-supervised approaches are commonly used to cope with limited labeled training data. However, the augmentation strategies for many existing approaches are either hand-engineered or require computationally demanding searches. To that end, we explore an augmentation strategy which builds statistical deformation models from unlabeled data via principal component analysis and uses the resulting statistical deformation space to augment the labeled training samples. Specifically, we obtain transformations via deep registration models. This allows for an intuitive control over plausible deformation magnitudes via the statistical model and, if combined with an appropriate deformation model, yields spatially regular transformations. To optimally augment a dataset we use an adversarial strategy integrated into our statistical deformation model. We demonstrate the effectiveness of our approach for the segmentation of knee cartilage from 3D magnetic resonance images. We show favorable performance to state-of-the-art augmentation approaches."



Paperid:1324
Authors:Wei Wen, Hanxiao Liu, Yiran Chen, Hai Li, Gabriel Bender, Pieter-Jan Kindermans
Title: Neural Predictor for Neural Architecture Search
Abstract:
Neural Architecture Search methods are effective but often use complex algorithms to come up with the best architecture. We propose an approach with three basic steps that is conceptually much simpler. First we train $N$ random architectures to generate $N$ (architecture, validation accuracy) pairs and use them to train a regression model that predicts accuracies for architectures. Next, we use this regression model to predict the validation accuracies of a large number of random architectures. Finally, we train the top-$K$ predicted architectures and deploy the model with the best validation result. While this approach seems simple, it is more than $20 imes$ as sample efficient as Regularized Evolution on the NASBench-101 benchmark. On ImageNet, it approaches the efficiency of more complex and restrictive approaches based on weight sharing such as ProxylessNAS while being fully (embarrassingly) parallelizable and friendly to hyper-parameter tuning."



Paperid:1325
Authors:Shivam Kalra, Mohammed Adnan, Graham Taylor, H.R. Tizhoosh
Title: Learning Permutation Invariant Representations using Memory Networks
Abstract:
Many real-world tasks such as classification of digital histopathological images and 3D object detection involve learning from a set of instances. In these cases, only a group of instances or a set, collectively, contains meaningful information and therefore only the sets have labels, and not individual data instances. In this work, we present a permutation invariant neural network called Memory-based Exchangeable Model (MEM) for learning universal set functions. The MEM model consists of memory units that embed an input sequence to high-level features enabling it to learn inter-dependencies among instances through a self-attention mechanism. We evaluated the learning ability of MEM on various toy datasets, point cloud classification, and classification of whole slide images (WSIs) into two subtypes of the lung cancer--Lung Adenocarcinoma, and Lung Squamous Cell Carcinoma. We systematically extracted patches from WSIs of the lung, downloaded from The Cancer Genome Atlas (TCGA) dataset, the largest public repository of WSIs, achieving a competitive accuracy of 84.84% for classification of two sub-types of lung cancer. The results on other datasets are promising as well, and demonstrate the efficacy of our model. \keywords{Permutation Invariant Models, Multi Instance Learning, Whole Slide Image Classification, Medical Images}"



Paperid:1326
Authors:Peng Chu, Xiao Bian, Shaopeng Liu, Haibin Ling
Title: Feature Space Augmentation for Long-Tailed Data
Abstract:
Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advanced methods on data re-sampling and augmentation are among the best practices to alleviate the data imbalance problem. However, the other part of the problem about the under-represented classes will have to rely on additional knowledge to recover the missing information.In this work, we present a novel approach to address the long-tailed problem by augmenting the under-represented classes in the feature space with the features learned from the classes with ample samples. In particular, we decompose the features of each class into a class-generic component and a class-specific component using class activation map. Novel samples of under-represented classes are then generated on the fly during training stages by fusing the class-specific features from the under-represented classes with the class-generic features from confusing classes. Our results on different datasets such as iNaturalist and a long-tailed version of CIFAR have shown superior performance compared to common practice and the state of the arts."



Paperid:1327
Authors:Samuel S. Sohn, Honglu Zhou, Seonghyeon Moon, Sejong Yoon, Vladimir Pavlovic, Mubbasir Kapadia
Title: Laying the Foundations of Deep Long-Term Crowd Flow Prediction
Abstract:
Predicting the crowd behavior in complex environments is a key requirement for crowd and disaster management, architectural design, and urban planning. Given a crowd's immediate state, current approaches must be successively repeated over multiple time-steps for long-term predictions, leading to compute expensive and error-prone results. However, most applications require the ability to accurately predict hundreds of possible simulation outcomes (e.g., under different environment and crowd situations) at real-time rates, for which these approaches are prohibitively expensive. We propose the first deep framework to instantly predict the long-term flow of crowds in arbitrarily large, realistic environments. Central to our approach are a novel representation CAGE, which efficiently encodes crowd scenarios into compact, fixed-size representations that losslessly represent the environment, and a modified SegNet architecture for instant long-term crowd flow prediction. We conduct comprehensive experiments on novel synthetic and real datasets. Our results indicate that our approach is able to capture the essence of real crowd movement over very long time periods, while generalizing to never-before-seen environments and crowd contexts."



Paperid:1328
Authors:Zhekun Luo, Devin Guillory, Baifeng Shi, Wei Ke, Fang Wan, Trevor Darrell, Huijuan Xu
Title: Weakly-Supervised Action Localization with Expectation-Maximization Multi-Instance Learning
Abstract:
Weakly-supervised action localization requires training a model to localize the action segments in the video given only video level action label. It can be solved under the Multiple Instance Learning (MIL) framework, where a bag (video) contains multiple instances (action segments). Since only the bag's label is known, the main challenge is assigning which key instances within the bag to trigger the bag's label. Most previous models use attention-based approaches applying attentions to generate the bag's representation from instances, and then train it via the bag's classification. These models, however, implicitly violate the MIL assumption that instances in negative bags should be uniformly negative. In this work, we explicitly model the key instances assignment as a hidden variable and adopt an Expectation-Maximization (EM) framework. We derive two pseudo-label generation schemes to model the E and M process and iteratively optimize the likelihood lower bound. We show that our EM-MIL approach more accurately models both the learning objective and the MIL assumptions. It achieves state-of-the-art performance on two standard benchmarks, THUMOS14 and ActivityNet1.2."



Paperid:1329
Authors:Mhd Hasan Sarhan, Nassir Navab, Abouzar Eslami, Shadi Albarqouni
Title: Fairness by Learning Orthogonal Disentangled Representations
Abstract:
Learning discriminative powerful representations is a crucial step for machine learning systems. Introducing invariance against arbitrary nuisance or sensitive attributes while performing well on specific tasks is an important problem in representation learning. This is mostly approached by purging the sensitive information from learned representations. In this paper, we propose a novel disentanglement approach to invariant representation problem. We disentangle the meaningful and sensitive representations by enforcing orthogonality constraints as a proxy for independence. We explicitly enforce the meaningful representation to be agnostic to sensitive information by entropy maximization. The proposed approach is evaluated on five publicly available datasets and compared with state of the art methods for learning fairness and invariance achieving the state of the art performance on three datasets and comparable performance on the rest. Further, we perform an ablative study to evaluate the effect of each component."



Paperid:1330
Authors:Cheng Ouyang, Carlo Biffi, Chen Chen, Turkay Kart, Huaqi Qiu, Daniel Rueckert
Title: Self-supervision with Superpixels: Training Few-shot Medical Image Segmentation without Annotation
Abstract:
Few-shot semantic segmentation (FSS) has great potential for medical imaging applications. Most of the existing FSS techniques require abundant annotated semantic classes for training. However, these methods may not be applicable for medical images due to the lack of annotations. To address this problem we make several contributions: (1) A novel self-supervised FSS framework for medical images in order to eliminate the requirement for annotations during training. Additionally, superpixel-based pseudo-labels are generated to provide supervision; (2) An adaptive local prototype pooling module plugged into prototypical networks, to solve the common challenging foreground-background imbalance problem in medical image segmentation; (3) We demonstrate the general applicability of the proposed approach for medical images using three different tasks: abdominal organ segmentation for CT and MRI, as well as cardiac segmentation for MRI. Our results show that, for medical image segmentation, the proposed method outperforms conventional FSS methods which require manual annotations for training."



Paperid:1331
Authors:He Zhao, Richard P. Wildes
Title: On Diverse Asynchronous Activity Anticipation
Abstract:
We investigate the joint anticipation of long-term activity labels and their corresponding times with the aim of improving both the naturalness and diversity of predictions. We address these matters using Conditional Adversarial Generative Networks for Discrete Sequences. Central to our approach is a reexamination of the unavoidable sample quality vs. diversity tradeoff of the recently emerged Gumbel-Softmax relaxation based GAN on discrete data. In particular, we ameliorate this trade-off with a simple but effective sample distance regularizer. Moreover, we provide a unified approach to inference of activity labels and their times so that a single integrated optimization succeeds for both. With this novel approach in hand, we demonstrate the effectiveness of the resulting discrete sequential GAN on multimodal activity anticipation. We evaluate the approach on three standard datasets and show that it outperforms previous approaches in terms of both accuracy and diversity, thereby yielding a new state-of-the-art in activity anticipation."



Paperid:1332
Authors:Razieh Kaviani Baghbaderani, Ying Qu, Hairong Qi, Craig Stutts
Title: Representative-Discriminative Learning for Open-set Land Cover Classification of Satellite Imagery
Abstract:
Land cover classification of satellite imagery is an important step toward analyzing the Earth's surface. Existing models assume a closed-set setting where both the training and testing classes belong to the same label set. However, due to the unique characteristics of satellite imagery with an extremely vast area of versatile cover materials, the training data are bound to be non-representative. In this paper, we study the problem of open-set land cover classification that identifies the samples belonging to unknown classes during testing, while maintaining performance on known classes. Although inherently a classification problem, both representative and discriminative aspects of data need to be exploited in order to better distinguish unknown classes from known. We propose a representative-discriminative open-set recognition (RDOSR) framework, which 1) projects data from the raw image space to the embedding feature space that facilitates differentiating similar classes, and further 2) enhances both the representative and discriminative capacity through transformation to a so-called abundance space. Experiments on multiple satellite benchmarks demonstrate the effectiveness of the proposed method. We also show the generality of the proposed approach by achieving promising results on open-set classification tasks using RGB images."



Paperid:1333
Authors:Ping Yu, Yang Zhao, Chunyuan Li, Junsong Yuan, Changyou Chen
Title: Structure-Aware Human-Action Generation
Abstract:
Generating long-range skeleton-based human actions has been a challenging problem since small deviations of one frame can cause an malformed action sequence. Most existing methods borrow ideas from video generation that naively treat skeleton nodes/joints as pixels of images without considering the rich inter-frame and intra-frame structure information, leading to potential distorted actions. Graph convolutional networks (GCNs) could leverage structure information to learn structure representations. However, adopting GCNs to tackle such continuous action sequences both in spatial and temporal space is challenging as the action graph could be huge. To overcome this challenge, we propose a variant of GCNs to leverage the self-attention mechanism to prune a complete action graph in the temporal space. Our method could dynamically attend to past important frames and construct a sparse graph to apply in the GCN framework, well capturing the structure information in action sequences. Extensive experimental results demonstrate the superiority of our method on two standard human action datasets compared with existing methods."



Paperid:1334
Authors:Niamul Quader, Juwei Lu, Peng Dai, Wei Li
Title: Towards Efficient Coarse-to-Fine Networks for Action and Gesture Recognition
Abstract:
State-of-the-art approaches to video-based action and gesture recognition often employ two key concepts: First, they employ multistream processing; second, they use an ensemble of convolutional networks. We improve and extend both aspects. First, we systematically yield enhanced receptive fields for complementary feature extraction via coarse-to-fine decomposition of input imagery along the spatial and temporal dimensions, and adaptively focus on training important feature pathways using a reparameterized fully connected layer. Second, we develop a `use when needed' scheme with a `coarse-exit' strategy that allows selective use of expensive high-resolution processing in a data-dependent fashion to retain accuracy while reducing computation cost. Our C2F learning approach builds ensemble networks that outperform most competing methods in terms of both reduced computation cost and improved accuracy on the Something-Something V1, V2, and Jester datasets, while also remaining competitive on the Kinetics-400 dataset. Uniquely, our C2F ensemble networks can operate at varying computation budget constraints."



Paperid:1335
Authors:Bin Cheng, Inderjot Singh Saggu, Raunak Shah, Gaurav Bansal, Dinesh Bharadia
Title: S³Net: Semantic-Aware Self-supervised Depth Estimation with Monocular Videos and Synthetic Data
Abstract:
Solving depth estimation with monocular cameras enables the possibility of widespread use of cameras as low-cost depth estimation sensors in applications such as autonomous driving and robotics. In order to learn such a scalable depth estimation model, we require a ton of data and labels which are targeted towards specific use-cases. Acquiring these labels is expensive and often requires a calibrated research platform to collect data, which can be unfeasible especially in situations where the terrain is unknown. There are two popular approaches that do not require annotated depth maps:(i) using labelled synthetic and unlabeled real data in an adversarial framework to predict more accurate depth, and (ii) unsupervised models which exploit geometric structure across space and time in monocular video frames. Ideally, we would like to leverage features provided by both approaches as they complement each other; however, existing methods do not adequately exploit these additive benefits. We present a self-supervised framework which combines these complementary features: We use synthetic as well as real images for training while exploiting geometric and temporal constraints. Our novel consolidated architecture performs better than existing state-of-the-art literature. We present a unique way to train this self-supervised framework, and achieve over 15% improvement over the previous supervised approaches with domain adaption and 10% over the previous self-supervised approaches."



Paperid:1336
Authors:Maunil R Vyas, Hemanth Venkateswara, Sethuraman Panchanathan
Title: Leveraging Seen and Unseen Semantic Relationships for Generative Zero-Shot Learning
Abstract:
Zero-shot learning (ZSL) addresses the unseen class recognition problem by leveraging semantic information to transfer knowledge from seen classes to unseen classes. Generative models synthesize the unseen visual features and convert ZSL into a classical supervised learning problem. These generative models are trained using the seen classes and are expected to implicitly transfer the knowledge from seen to unseen classes. However, their performance is stymied by overfitting, which leads to substandard performance on Generalized Zero-Shot learning (GZSL). To address this concern, we propose the novel LsrGAN, a generative model that Leverages the Semantic Relationship between seen and unseen categories and explicitly performs knowledge transfer by incorporating a novel Semantic Regularized Loss (SR-Loss). The SR-loss guides the LsrGAN to generate visual features that mirror the semantic relationships between seen and unseen classes. Experiments on seven benchmark datasets, including the challenging Wikipedia text-based CUB and NABirds splits, and Attribute-based AWA, CUB, and SUN, demonstrates the superiority of the LsrGAN compared to previous state-of-the-art approaches under both ZSL and GZSL. Our code will be released after the review period. "



Paperid:1337
Authors:Niamul Quader, Md Mafijul Islam Bhuiyan, Juwei Lu, Peng Dai, Wei Li
Title: Weight Excitation: Built-in Attention Mechanisms in Convolutional Neural Networks
Abstract:
We propose novel approaches for simultaneously identifying important weights of a convolutional neural network (ConvNet) and providing more attention to the important weights during training. More formally, we identify two characteristics of a weight, its magnitude and its location, which can be linked with the importance of the weight. By targeting these characteristics of a weight during training, we develop two separate weight excitation (WE) mechanisms via weight reparameterization-based backpropagation modifications. We demonstrate significant improvements over popular baseline ConvNets on multiple computer vision applications using WE (e.g. 1.3% accuracy improvement over ResNet50 baseline on ImageNet image classification, etc.). These improvements come at no extra computational cost or ConvNet structural change during inference. Additionally, including WE methods in a convolution block is straightforward, requiring few lines of extra code. Lastly, WE mechanisms can provide complementary benefits when used with external attention mechanisms such as the popular Squeeze-and-Excitation attention block."



Paperid:1338
Authors:Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu
Title: UNITER: UNiversal Image-TExt Representation Learning
Abstract:
Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$."



Paperid:1339
Authors:Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang, Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, Jianfeng Gao
Title: Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Abstract:
Large-scale pre-training methods of learning cross-modal representations on image-text pairs are becoming popular for vision-language tasks. While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar, which uses object tags detected in images as anchor points to significantly ease the learning of alignments. Our method is motivated by the observation that the salient objects in an image can be accurately detected, and are often mentioned in the paired text. We pre-train an Oscar model on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks."



Paperid:1340
Authors:Yuge Huang, Pengcheng Shen, Ying Tai, Shaoxin Li, Xiaoming Liu, Jilin Li, Feiyue Huang, Rongrong Ji
Title: Improving Face Recognition from Hard Samples via Distribution Distillation Loss
Abstract:
Large facial variations are the main challenge in face recognition. To this end, previous variation-specific methods make full use of task-related prior to design special network losses, which are typically not general among different tasks and scenarios. In contrast, the existing generic methods focus on improving the feature discriminability to minimize the intra-class distance while maximizing the interclass distance, which perform well on easy samples but fail on hard samples. To improve the performance on those hard samples for general tasks, we propose a novel Distribution Distillation Loss to narrow the performance gap between easy and hard samples, which is a simple, effective and generic for various types of facial variations. Specifically, we first adopt state-of-the-art classifiers such as ArcFace to construct two similarity distributions: teacher distribution from easy samples and student distribution from hard samples. Then, we propose a novel distribution-driven loss to constrain the student distribution to approximate the teacher distribution, which thus leads to smaller overlap between the positive and negative pairs in the student distribution. We have conducted extensive experiments on both generic large-scale face benchmarks and benchmarks with diverse variations on race, resolution and pose. The quantitative results demonstrate the superiority of our method over strong baselines, e.g., Arcface and Cosface. "



Paperid:1341
Authors:Jianqiao An, Yucheng Shi, Yahong Han, Meijun Sun, Qi Tian
Title: Extract and Merge: Superpixel Segmentation with Regional Attributes
Abstract:
For a certain object in an image, the relationship between its central region and the peripheral region is not well utilized in existing superpixel segmentation methods. In this work, we propose the concept of regional attribute, which indicates the location of a certain region in the object. Based on the regional attributes, we propose a novel superpixel method called Extract and Merge (EAM). In the extracting stage, we design square windows with a side length of a power of two, named power-window, to extract regional attributes by calculating boundary clearness of objects in the window. The larger windows are for the central regions and the smaller ones correspond to the peripheral regions. In the merging stage, power-windows are merged according to the defined attraction between them. Specifically, we build a graph model and propose an efficient method to make the large windows merge the small ones strategically, regarding power-windows as vertices and the adjacencies between them as edges. We demonstrate that our superpixels have fine boundaries and are superior to the respective state-of-the-art algorithms on multiple benchmarks."



Paperid:1342
Authors:Meng Chang, Qi Li, Huajun Feng, Zhihai Xu
Title: Spatial-Adaptive Network for Single Image Denoising
Abstract:
Previous works have shown that convolutional neural networks can achieve good performance in image denoising tasks. However, limited by the local rigid convolutional operation, these methods lead to oversmoothing artifacts. A deeper network structure could alleviate these problems, but at the cost of additional computational overhead. In this paper, we propose a novel spatial-adaptive denoising network (SADNet) for effcient single image blind noise removal. To adapt to changes in spatial textures and edges, we design a residual spatial-adaptive block. Deformable convolution is introduced to sample the spatially related features for weighting. An encoder-decoder structure with a context block is introduced to capture multiscale information. By conducting noise removal from coarse to fine, a high-quality noise-free image is obtained. We apply our method to both synthetic and real noisy image datasets. The experimental results demonstrate that our method outperforms the state-of-the-art denoising methods both quantitatively and visually."



Paperid:1343
Authors:Jiangxin Dong, Jinshan Pan
Title: Physics-based Feature Dehazing Networks
Abstract:
We propose a physics-based feature dehazing network for image dehazing. In contrast to most existing end-to-end trainable network-based dehazing methods, we explicitly consider the physics model of the haze process in the network design and remove haze in a deep feature space. We propose an effective feature dehazing unit (FDU), which is applied to the deep feature space to explore useful features for image dehazing based on the physics model. The FDU is embedded into an encoder and decoder architecture with residual learning, so that the proposed network can be trained in an end-to-end fashion and effectively help haze removal. The encoder and decoder modules are adopted for feature extraction and clear image reconstruction, respectively. The residual learning is applied to increase the accuracy and ease the training of deep neural networks. We analyze the effectiveness of the proposed network and demonstrate that it can effectively dehaze images with favorable performance against state-of-the-art methods."



Paperid:1344
Authors:Yash Patel, Tomáš Hodaň, Jiří Matas
Title: Learning Surrogates via Deep Embedding
Abstract:
This paper proposes a technique for training neural networks by minimizing surrogate losses that approximate the target evaluation metric, which may be non-differentiable. The surrogates are learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned on the learned surrogate. The scores on the evaluation metric are improved without any significant computational overhead. Without any bells and whistles, improvements are demonstrated on challenging and the practical tasks of scene-text recognition (training with the edit distance metric) and scene-text detection (training with the intersection over union metric for rotated bounding boxes). Relative improvements of up to $38\%$ (in the total edit distance) and $4.25\%$ (in the $F_{1}$ score) were achieved in the recognition and detection tasks respectively."



Paperid:1345
Authors:Jibin Gao, Wei-Shi Zheng, Jia-Hui Pan, Chengying Gao, Yaowei Wang, Wei Zeng, Jianhuang Lai
Title: An Asymmetric Modeling for Action Assessment
Abstract:
Action assessment is a task of assessing the performance of an action. It is widely applicable to many real-world scenarios such as medical treatment and sporting events. However, existing methods for action assessment are mostly limited to individual actions, especially lacking modeling of the asymmetric relations among agents (e.g., between persons and objects); and this limitation undermines their ability to assess actions containing asymmetrically interactive motion patterns, since there always exists subordination between agents in many interactive actions. In this work, we model the asymmetric interactions among agents for action assessment. In particular, we propose an asymmetric interaction module (AIM), to explicitly model asymmetric interactions between intelligent agents within an action, where we group these agents into a primary one (e.g., human) and secondary ones (e.g., objects). We perform experiments on JIGSAWS dataset containing surgical actions, and additionally collect a new dataset, TASD-2, for interactive sporting actions. The experimental results on two interactive action datasets show the effectiveness of our model, and our method achieves state-of-the-art performance. The extended experiment on AQA-7 dataset also demonstrates the generalization capability of our framework to conventional action assessment."



Paperid:1346
Authors:Wenyu Sun, Chen Tang, Weigui Li, Zhuqing Yuan, Huazhong Yang, Yongpan Liu
Title: High-quality Single-model Deep Video Compression with Frame-Conv3D and Multi-frame Differential Modulation
Abstract:
Deep learning (DL) methods have revolutionized the paradigm of computer vision tasks and DL-based video compression is becoming a hot topic. This paper proposes a deep video compression method to simultaneously encode multiple frames with Frame-Conv3D and differential modulation. We first adopt Frame-Conv3D instead of traditional Channel-Conv3D for efficient multi-frame fusion. When generating the binary representation, the multi-frame differential modulation is utilized to alleviate the effect of quantization noise. By analyzing the forward and backward computing flow of the modulator, we identify that this technique can make full use of past frames' information to remove the redundancy between multiple frames, and thus achieves better performance. A dropout scheme combined with the differential modulator is proposed to enable bit rate optimization within a single model. Experimental results show that the proposed approach outperforms the H.264 and H.265 codecs in the region of low bit rate. Compared with recent DL-based methods, our model also achieves competitive performance. "



Paperid:1347
Authors:Tong He, Yifan Liu, Chunhua Shen, Xinlong Wang, Changming Sun
Title: Instance-Aware Embedding for Point Cloud Instance Segmentation
Abstract:
Although recent works have made significant progress in encoding meaningful context information for instance segmentation in 2D images, the works for 3D point cloud counterpart lag far behind. Conventional methods use radius search or other similar methods for aggregating local information. However, these methods are unaware of the instance context and fail to realize the boundary and geometric information of an instance, which are critical to separate adjacent objects. In this work, we study the influence of instance-aware knowledge by proposing an Instance-Aware Module (IAM). The proposed IAM learns discriminative instance embedding features in two-fold: (1) Instance contextual regions, covering the spatial extension of an instance, are implicitly learned and propagated in the decoding process. (2) Instance-dependent geometric knowledge is included in the embedding space, which is informative and critical to discriminate adjacent instances. Moreover, the proposed IAM is free from complex and time-consuming operations, showing superiority in both accuracy and efficiency over the previous methods. To validate the effectiveness of our proposed method, comprehensive experiments have been conducted on three popular benchmarks for instance segmentation: ScannetV2, S3DIS, and PartNet. The flexibility of our method allows it to handle both indoor scenes and CAD objects. We achieve state-of-the-art performance, outperforming previous methods substantially."



Paperid:1348
Authors:Lili Pan, Shijie Ai, Yazhou Ren, Zenglin Xu
Title: Self-Paced Deep Regression Forests with Consideration on Underrepresented Examples
Abstract:
Deep discriminative models (e.g.deep regression forests, deep neural decision forests) have achieved remarkable success recently to solve problems such as facial age estimation and head pose estimation. Most existing methods pursue robust and unbiased solutions either through learning discriminative features, or reweighting samples. We argue what is more desirable is learning gradually to discriminate like our human beings, and hence we resort to self-paced learning (SPL). Then, a natural question arises: can self-paced regime lead deep discriminative models to achieve more robust and less biased solutions? To this end, this paper proposes a new deep discriminative model—self-paceddeep regression forests with consideration on underrepresented examples (SPUDRFs). It tackles the fundamental ranking and selecting problem in SPL from a new perspective: fairness. This paradigm is fundamental and could be easily combined with a variety of deep discriminative models (DDMs). Extensive experiments on two computer vision tasks, i.e., facial age estimation and head pose estimation, demonstrate the efficacy of SPUDRFs, where state-of-the-art performances are achieved."



Paperid:1349
Authors:Jianli Zhou, Chao Liang, Jun Chen
Title: Manifold Projection for Adversarial Defense on Face Recognition
Abstract:
Although deep convolutional neural network based face recognition system has achieved remarkable success, it is susceptible to adversarial images: carefully constructed imperceptible perturbations can easily mislead deep neural networks. A recent study has shown that in addition to regular off-manifold adversarial images, there are also adversarial images on the manifold. In this paper, we propose adversarial variational autoencoder (A-VAE), a novel framework to tackle both types of attacks. We hypothesize that both off-manifold and on-manifold attacks move the image away from the high probability region of image manifold. We utilize variational autoencoder (VAE) to estimate the lower bound of the log-likelihood of image and explore to project the input images back into the high probability regions of image manifold again. At inference time, our model synthesizes multiple similar realizations of a given image by random sampling, then the nearest neighbor of the given image is selected as the final input of the face recognition model. We also use adversarial training to enhance the robustness of our model against adversarial perturbations. As a preprocessing operation, our method is attack-agnostic and can adapt to a wide range of resolutions. The experimental results on LFW demonstrate that our method achieves state-of-the-art defense success rate against conventional off-manifold attacks such as FGSM, PGD, and C\&W under both grey-box and white-box settings, and even on-manifold attack."



Paperid:1350
Authors:Lele Cheng, Xiangzeng Zhou, Liming Zhao, Dangwei Li, Hong Shang, Yun Zheng, Pan Pan, Yinghui Xu
Title: Weakly Supervised Learning with Side Information for Noisy Labeled Images
Abstract:
In many real-world datasets, like WebVision, the performance of DNN based classi er is often limited by the noisy labeled data. To tackle this problem, some image related side information, such as captions and tags, often reveal underlying relationships across images. In this paper, we present an efficient weakly-supervised learning by using a Side Information Network (SINet), which aims to effectively carry out a large scale classi cation with severely noisy labels. The proposed SINet consists of a visual prototype module and a noise weighting module. The visual prototype module is designed to generate a compact representation for each category by introducing the side information. The noise weighting module aims to estimate the correctness of each noisy image and produce a con dence score for image ranking during the training procedure. The propsed SINet can largely alleviate the negative impact of noisy image labels, and is bene cial to train a high performance CNN based classi er. Besides, we release a ne-grained product dataset called AliProducts, which contains more than 2.5 million noisy web images crawled from the internet by using queries generated from 50,000 fine-grained semantic classes. Extensive experiments on several popular benchmarks (i.e. Webvision, ImageNet and Clothing-1M) and our proposed AliProducts achieve state-of-the-art performance. The SINet has won the rst place in the 5000 category classi cation task on WebVision Challenge 2019, and outperforms other competitors by a large margin."



Paperid:1351
Authors:Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu , Zhiwei Yang
Title: Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision
Abstract:
but also Listen: Learning Multimodal Violence Detection under Weak Supervision","Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal (audio-visual) input and modeling relationships. The code and dataset will be released in https://roc-ng.github.io/XD-Violence/."



Paperid:1352
Authors:Rui Fan, Hengli Wang, Peide Cai, Ming Liu
Title: SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentation for Accurate Freespace Detection
Abstract:
Freespace detection is an essential component of visual perception for self-driving cars. The recent efforts made in data-fusion convolutional neural networks (CNNs) have significantly improved semantic driving scene segmentation. Freespace can be hypothesized as a ground plane, on which the points have similar surface normals. Hence, in this paper, we first introduce a novel module, named surface normal estimator (SNE), which can infer surface normal information from dense depth/disparity images with high accuracy and efficiency. Furthermore, we propose a data-fusion CNN architecture, referred to as RoadSeg, which can extract and fuse features from both RGB images and the inferred surface normal information for accurate freespace detection. For research purposes, we publish a large-scale synthetic freespace detection dataset, named Ready-to-Drive (R2D) road dataset, collected under different illumination and weather conditions. The experimental results demonstrate that our proposed SNE module can benefit all the state-of-the-art CNNs for freespace detection, and our SNE-RoadSeg achieves the best overall performance among different datasets."



Paperid:1353
Authors:Chengfeng Wen, Yang Guo, Xianfeng Gu
Title: Modeling the Space of Point Landmark Constrained Diffeomorphisms
Abstract:
Surface registration plays a fundamental role in shape analysis and geometric processing. Generally, there are three criteria in evaluating a surface mapping result: diffeomorphism, small distortion and feature alignment. In order to fulfill these requirements, this work proposes a novel model of the space of point landmark constrained diffeomorphisms. Based on Teichm\""uller theory, this mapping space is generated by the Beltrami coefficients, which are infinitesimally Teichm\""uller equivalent to $0$. These Beltrami coefficients are the solutions to a linear equation group. By using this theoretic model, optimal registrations can be achieved by iterative optimization with linear constraints in the diffeomorphism space, such as harmonic maps and Teichm\""uller maps, which minimize different type of distortions. The theoretic model is rigorous and has practical value. Our experiment results demonstrate the efficiency and efficacy of the proposed method."



Paperid:1354
Authors:Han-Ul Kim, Young Jun Koh, Chang-Su Kim
Title: PieNet: Personalized Image Enhancement Network
Abstract:
Image enhancement is an inherently subjective process since people have diverse preferences for image aesthetics. However, most enhancement techniques pay less attention to the personalization issue despite its importance. In this paper, we propose the first deep learning approach to personalized image enhancement, which can enhance new images for a new user, by asking him or her to select about 10$\sim$20 preferred images from a random set of images. First, we represent various users' preferences for enhancement as feature vectors in an embedding space, called preference vectors. We construct the embedding space based on metric learning. Then, we develop the personalized image enhancement network (PieNet) to enhance images adaptively using each user's preference vector. Experimental results demonstrate that the proposed algorithm is capable of achieving personalization successfully, as well as outperforming conventional general image enhancement algorithms significantly. The source codes and trained models are available at https://github.com/hukim1124/PieNet."



Paperid:1355
Authors:Arman Karimian, Ziqi Yang, Roberto Tron
Title: Rotational Outlier Identification in Pose Graphs Using Dual Decomposition
Abstract:
In the last few years, there has been an increasing trend to consider Structure from Motion (SfM, in computer vision) and Simultaneous Localization and Mapping (SLAM, in robotics) problems from the point of view of pose averaging (also known as global SfM, in computer vision) or Pose Graph Optimization (PGO, in robotics), where the motion of the camera is reconstructed by considering only relative rigid body transformations instead of including also 3-D points (as done in a full Bundle Adjustment). At a high level, the advantage of this approach is that modern solvers can effectively avoid most of the problems of local minima, and that it is easier to reason about outlier poses (caused by feature mismatches and repetitive structures in the images). In this paper, we contribute to the state of the art of the latter, by proposing a method to detect incorrect orientation measurements prior to pose graph optimization by checking the geometric consistency of rotation measurements. The novel aspects of our method are the use of Expectation-Maximization to fine-tune the covariance of the noise in inlier measurements, and a new approximate graph inference procedure, of independent interest, that is specifically designed to take advantage of evidence on cycles with better performance than standard approaches (Belief Propagation). The paper includes simulation and experimental results that evaluate the performance of our outlier detection and cycle-based inference algorithms on synthetic and real-world data."



Paperid:1356
Authors:Dipanjan Das, Sandika Biswas, Sanjana Sinha, Brojeshwar Bhowmick
Title: Speech-driven Facial Animation using Cascaded GANs for Learning of Motion and Texture
Abstract:
Speech-driven facial animation methods should produce accurate and realistic lip motions with natural expressions and realistic texture portraying target-specific facial characteristics. Moreover, the methods should also be adaptable to any unknown faces and speech quickly during inference. Current state-of-the-art methods fail to generate realistic animation from any speech on unknown faces due to their poor gen-eralization over different facial characteristics, languages, and accents. Some of these failures can be attributed to the end-to-end learning of the complex relationship between the multiple modalities of speech and the video. In this paper, we propose a novel strategy where we partition the problem and learn the motion and texture separately. Firstly, we train a GAN network to learn the lip motion in a canonical landmark using DeepSpeech features and induce eye-blinks before transferring the motion to the person-specific face. Next, we use another GAN based texture generator network to generate high fidelity face corresponding to the motion on person-specific landmark. We use meta-learning to make the texture generator GAN more flexible to adapt to the unknown subject’s traits of the face during inference. Our method gives significantly improved facial animation than the state-of-the-art methods and generalizes well across the different datasets, different languages, and accents,and also works reliably well in presence of noises in the speech."



Paperid:1357
Authors:Rakib Hyder, Zikui Cai, M. Salman Asif
Title: Solving Phase Retrieval with a Learned Reference
Abstract:
Fourier phase retrieval is a classical problem that deals with the recovery of an image from the amplitude measurements of its Fourier coefficients. Conventional methods solve this problem via iterative (alternating) minimization by leveraging some prior knowledge about the structure of the unknown image. The inherent ambiguities about shift and flip in the Fourier measurements make this problem especially difficult; and most of the existing methods use several random restarts with different permutations. In this paper, we assume that a known (learned) reference is added to the signal before capturing the Fourier amplitude measurements. Our method is inspired by the principle of adding a reference signal in holography. To recover the signal, we implement an iterative phase retrieval method as an unrolled network. Then we use back propagation to learn the reference that provides us the best reconstruction for a fixed number of phase retrieval iterations. We performed a number of simulations on a variety of datasets under different conditions and found that our proposed method for phase retrieval via unrolled network and learned reference provides near-perfect recovery at fixed (small) computational cost. We compared our method with standard Fourier phase retrieval methods and observed significant performance enhancement using the learned reference."



Paperid:1358
Authors:Chengde Wan, Thomas Probst, Luc Van Gool, Angela Yao
Title: Dual Grid Net: Hand Mesh Vertex Regression from Single Depth Maps
Abstract:
We aim to recover the dense 3D surface of the hand from depth maps and propose a network that can predict mesh vertices, transformation matrices for every joint and joint coordinates in a single forward pass. Use fully convolutional architectures, we first map depth image features to the mesh grid and then regress the mesh coordinates into real world 3D coordinates. The final mesh is found by sampling from the mesh grid refit in closed-form based on an articulated template mesh.When trained with supervision from sparse key-points, our accuracy is comparable with state-of-the-art on the NYU dataset for key point localization, all while recovering mesh vertices and dense correspondences. Under multi-view settings for training, our framework can also learn through self-supervision by minimizing a set of data-fitting terms and kinematic priors. Our approach is competitive with strongly supervised methods and showcases the potential for self-supervision in dense mesh estimation."